Skip to content

Programming-The-Next-Step-2026/mini_benchmarks

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

31 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

miniBen — Mini LLM Benchmarks

miniBen evaluates Large Language Models on small benchmarks through OpenRouter. Each benchmark runs several prompt variants, sends them to a model, parses replies, and aggregates a score.

Status: run_benchmark() and AIModel.ask() work end-to-end. Both benchmarks have parsers and scorers. Chess replies are parsed from JSON sentinels (JSON_START / JSON_END) when present, with a fallback to a valid 6×8 Python list literal in the reply text.


Table of contents


How it works

prompts.py  →  runner.run_benchmark()  →  model.AIModel.ask()  →  OpenRouter
                                                                      ↓
                                          scorers.score_*()  ←  parsers.parse_*()  

For cognitive flexibility and creativity, run_benchmark():

  1. Builds 5 prompts (one per variant — opening or word set).
  2. Calls the API 5 times via AIModel.ask().
  3. Parses each reply (errors are caught; failed parses become None).
  4. Scores each parse with the benchmark scorer.

Requirements

  • Python 3.9+
  • OpenRouter API key
  • Network access for live runs

Installation

git clone https://github.com/Programming-The-Next-Step-2026/mini_benchmarks.git
cd mini_benchmarks
python -m venv .venv
source .venv/bin/activate          # Windows: .venv\Scripts\activate
pip install -e ".[dev]"

Creativity scoring needs the spaCy English model (installed automatically with the package above). If you still see Can't find model 'en_core_web_sm', run:

python -m spacy download en_core_web_sm

Verify:

python -c "from miniBen import BENCHMARKS; print(list(BENCHMARKS))"
# ['cognitive flexibility', 'creativity']

API key

Set OPENROUTER_API_KEY (loaded from .env via python-dotenv on import).

.env in project root:

OPENROUTER_API_KEY=sk-or-v1-your-key-here

Shell:

export OPENROUTER_API_KEY="sk-or-v1-your-key-here"

Python:

from miniBen import check_openrouter_api_key_exist, put_openrouter_api_key_into_env

if not check_openrouter_api_key_exist():
    put_openrouter_api_key_into_env()

Keys: openrouter.ai/keys. Never commit secrets.


Quick start

Streamlit UI (web app)

Install the optional UI extra:

pip install -e ".[ui]"

Run the web app:

streamlit run src/miniBen/streamlit_app.py

Command line (example.py)

python -m miniBen.example --list
python -m miniBen.example --benchmark creativity --model openrouter/free
python -m miniBen.example --benchmark "cognitive flexibility" --reasoning

Full benchmark (5 API calls)

from miniBen import run_benchmark

results = run_benchmark(
    model_name="openrouter/free",
    benchmark_name="creativity",
    reasoning=False,
)

print(results["num_calls"])   # 5
print(results["outputs"][0])  # per-call detail
print(results["score"])       # aggregate benchmark score

Benchmarks

benchmark_name Display name Calls What changes each call
cognitive flexibility Meta-Chess Game (Cognitive Flexibility) 5 Trial 1 opening
creativity Creativity in story writing 5 Required three words

Cognitive flexibility — openings

Defined in COG_FLEX_OPENINGS (prompts.py):

# White Black
1 e4 e5
2 e4 c5
3 d4 d5
4 e4 e6
5 d4 c6

Each call uses build_cog_flex_prompt(white, black). The model is asked for one Meta-Chess game as JSON between <<<JSON_START>>> / <<<JSON_END>>> (see JSON_START, JSON_END in prompts.py). parse_chess_output() prefers that JSON block; if sentinels are missing but the reply still contains a valid 6×8 list-of-lists of SAN strings, the parser accepts it as a fallback.

Creativity — word triplets

Defined in CREATIVITY_WORD_TRIPLETS:

# Words
1 stamp, letter, send
2 kitchen, cook, food
3 bike, wheel, ride
4 book, read, story
5 study, laptop, desk

Each call uses build_creativity_prompt(word1, word2, word3) for one five-sentence story. score_creativity() computes surprise per story from sentence-level semantic shifts. Novelty (distinctiveness vs. the other stories in the run) is only defined when at least two stories parse successfully; the aggregate score then includes novelty_scores and avg_novelty.


API reference

AIModel (model.py)

Constructor AIModel(model: str) — OpenRouter model slug
Method `ask(prompt: str, reasoning: bool = True) -> tuple[str, str
Client base_url="https://openrouter.ai/api/v1"

run_benchmark (runner.py)

Argument Description
model_name OpenRouter model slug
benchmark_name cognitive flexibility or creativity
reasoning Passed to every ask() (default True)

Raises KeyError for unknown benchmarks. Raises ValueError if prompt and variant counts differ.

Prompt helpers (prompts.py)

Name Role
cog_flex_prompts() List of 5 Meta-Chess prompts
creativity_prompts() List of 5 story prompts
build_cog_flex_prompt(white, black) Single chess prompt
build_creativity_prompt(w1, w2, w3) Single story prompt
JSON_START, JSON_END Sentinel strings for chess JSON output

Parsers (parsers.py)

Function Returns
parse_chess_output(raw) 6×8 SAN matrix, or None (JSON sentinels first, then list literal)
parse_creativity_output(raw) {"story": str, "sentence_count": int}, or None if empty after strip

Scorers (scorers.py)

Function Aggregate score keys (high level)
score_chess(parsed_list) compliance_rate, total_violations, total_moves, failed_games, scored_games, games, …
score_creativity(parsed_list) stories_scored, surprise_scores, avg_surprise, novelty_scores, avg_novelty

Auth (auth.py)

Function Returns
check_openrouter_api_key_exist() True if key is set
put_openrouter_api_key_into_env() Prompts and sets key if missing

Registry (runner.py)

from miniBen import BENCHMARKS

# Each entry: name, prompts(), variants(), parser, scorer

Return values

AIModel.ask()

(content: str, reasoning_text: str | None)

run_benchmark()

{
    "model": str,
    "benchmark": str,
    "num_calls": int,           # len(prompts), typically 5
    "variants": list[dict],     # per-call metadata
    "parsed": list,             # one entry per call (None if parse fails)
    "outputs": list[dict],      # see below
    "score": Any,               # aggregate score from scorer(parsed)
}

Each outputs item:

{
    "call": int,              # 1 … num_calls
    "variant": dict,          # e.g. {"white": "e4", "black": "e5"}
                              # or {"words": ["stamp", "letter", "send"]}
    "content": str,           # model reply text
    "reasoning": str | None,
    "parsed": Any | None,
}

Parse failures are logged as warnings; that call’s parsed is None and the run continues.

Aggregate score (benchmark-specific)

Cognitive flexibility (score_chess):

{
    "compliance_rate": float,      # 0.0–1.0; parse-failed games excluded
    "total_violations": int,
    "total_moves": int,
    "failed_games": int,
    "scored_games": int,
    "per_trial_violations": list[int],
    "games": list[dict],           # per-game detail
}

Creativity (score_creativity):

{
    "stories_scored": int,
    "surprise_scores": list[float],   # one per successfully parsed story
    "avg_surprise": float,
    "novelty_scores": list[float] | None,  # None if fewer than 2 stories
    "avg_novelty": float | None,
}

Project layout

mini_benchmarks/
├── .miniben/
│   └── run_history.json
├── docs/
│   ├── vignette.ipynb
│   ├── vignette.md
│   └── workflow.md
├── src/
│   ├── miniBen/
│   │   ├── __init__.py
│   │   ├── auth.py
│   │   ├── example.py
│   │   ├── job_runner.py
│   │   ├── model.py
│   │   ├── parsers.py
│   │   ├── prompts.py
│   │   ├── runner.py
│   │   ├── scorers.py
│   │   ├── streamlit_app.py
│   │   └── ui_helpers.py
│   └── help.py
├── tests/
│   ├── conftest.py
│   ├── test_auth.py
│   ├── test_job_runner.py
│   ├── test_model.py
│   ├── test_parsers.py
│   ├── test_runner.py
│   ├── test_scorers.py
│   └── test_ui_helpers.py
├── .env
├── LICENSE
├── pyproject.toml
└── README.md

Import from: from miniBen import …
CLI: python -m miniBen.example


Development

Tests

pytest

Mocks OpenAI — no API key or network needed.

Add a benchmark

  1. Add variant constants and a *_prompts() -> list[str] factory in prompts.py.
  2. Implement parse_* and score_*.
  3. Add a BENCHMARKS entry in runner.py with prompts, variants, parser, scorer.
  4. Export from __init__.py if public.

Models

Use slugs from OpenRouter models. If reasoning breaks, pass reasoning=False.


Troubleshooting

Issue Fix
401 / missing key Set OPENROUTER_API_KEY in .env or environment
KeyError: Unknown benchmark Use cognitive flexibility or creativity exactly
ValueError: None content block Try another model or reasoning=False
parsed is often None (chess) Prefer JSON between JSON_START / JSON_END with {"trials": [[...], ...]} (6×8 strings); or a bare 6×8 list literal of SAN moves
avg_novelty is n/a (creativity) Need at least two successfully parsed stories in the same run
[E050] Can't find model 'en_core_web_sm' Run python -m spacy download en_core_web_sm or reinstall: pip install -e .
ImportError pip install -e . from repo root
Rate limits Wait or switch model on OpenRouter

Citations

This package

@software{yang2026miniben,
  author = {Leyla Yang},
  title  = {miniBen: Mini LLM Benchmarks},
  year   = {2026},
  url    = {https://github.com/Programming-The-Next-Step-2026/mini_benchmarks}
}

Benchmark sources

Task Credit
Meta-Chess / cognitive flexibility Original to this project
Creativity Adapted from creative-story-genarXiv:2411.02316
@misc{ismayilzada2024evaluatingcreativeshortstory,
  title         = {Evaluating Creative Short Story Generation in Humans and Large Language Models},
  author        = {Mete Ismayilzada and Claire Stevenson and Lonneke van der Plas},
  year          = {2024},
  eprint        = {2411.02316},
  archivePrefix = {arXiv},
  primaryClass  = {cs.CL},
  url           = {https://arxiv.org/abs/2411.02316}
}

License

MIT — see LICENSE. Upstream benchmark repos have their own licenses when you integrate them.


Links

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages