miniBen evaluates Large Language Models on small benchmarks through OpenRouter. Each benchmark runs several prompt variants, sends them to a model, parses replies, and aggregates a score.
Status:
run_benchmark()andAIModel.ask()work end-to-end. Both benchmarks have parsers and scorers. Chess replies are parsed from JSON sentinels (JSON_START/JSON_END) when present, with a fallback to a valid 6×8 Python list literal in the reply text.
- How it works
- Requirements
- Installation
- API key
- Quick start
- Benchmarks
- API reference
- Return values
- Project layout
- Development
- Troubleshooting
- Citations
- License
prompts.py → runner.run_benchmark() → model.AIModel.ask() → OpenRouter
↓
scorers.score_*() ← parsers.parse_*()
For cognitive flexibility and creativity, run_benchmark():
- Builds 5 prompts (one per variant — opening or word set).
- Calls the API 5 times via
AIModel.ask(). - Parses each reply (errors are caught; failed parses become
None). - Scores each parse with the benchmark scorer.
- Python 3.9+
- OpenRouter API key
- Network access for live runs
git clone https://github.com/Programming-The-Next-Step-2026/mini_benchmarks.git
cd mini_benchmarks
python -m venv .venv
source .venv/bin/activate # Windows: .venv\Scripts\activate
pip install -e ".[dev]"Creativity scoring needs the spaCy English model (installed automatically with the package above). If you still see Can't find model 'en_core_web_sm', run:
python -m spacy download en_core_web_smVerify:
python -c "from miniBen import BENCHMARKS; print(list(BENCHMARKS))"
# ['cognitive flexibility', 'creativity']Set OPENROUTER_API_KEY (loaded from .env via python-dotenv on import).
.env in project root:
OPENROUTER_API_KEY=sk-or-v1-your-key-hereShell:
export OPENROUTER_API_KEY="sk-or-v1-your-key-here"Python:
from miniBen import check_openrouter_api_key_exist, put_openrouter_api_key_into_env
if not check_openrouter_api_key_exist():
put_openrouter_api_key_into_env()Keys: openrouter.ai/keys. Never commit secrets.
Install the optional UI extra:
pip install -e ".[ui]"Run the web app:
streamlit run src/miniBen/streamlit_app.pypython -m miniBen.example --list
python -m miniBen.example --benchmark creativity --model openrouter/free
python -m miniBen.example --benchmark "cognitive flexibility" --reasoningfrom miniBen import run_benchmark
results = run_benchmark(
model_name="openrouter/free",
benchmark_name="creativity",
reasoning=False,
)
print(results["num_calls"]) # 5
print(results["outputs"][0]) # per-call detail
print(results["score"]) # aggregate benchmark scorebenchmark_name |
Display name | Calls | What changes each call |
|---|---|---|---|
cognitive flexibility |
Meta-Chess Game (Cognitive Flexibility) | 5 | Trial 1 opening |
creativity |
Creativity in story writing | 5 | Required three words |
Defined in COG_FLEX_OPENINGS (prompts.py):
| # | White | Black |
|---|---|---|
| 1 | e4 | e5 |
| 2 | e4 | c5 |
| 3 | d4 | d5 |
| 4 | e4 | e6 |
| 5 | d4 | c6 |
Each call uses build_cog_flex_prompt(white, black). The model is asked for one Meta-Chess game as JSON between <<<JSON_START>>> / <<<JSON_END>>> (see JSON_START, JSON_END in prompts.py). parse_chess_output() prefers that JSON block; if sentinels are missing but the reply still contains a valid 6×8 list-of-lists of SAN strings, the parser accepts it as a fallback.
Defined in CREATIVITY_WORD_TRIPLETS:
| # | Words |
|---|---|
| 1 | stamp, letter, send |
| 2 | kitchen, cook, food |
| 3 | bike, wheel, ride |
| 4 | book, read, story |
| 5 | study, laptop, desk |
Each call uses build_creativity_prompt(word1, word2, word3) for one five-sentence story. score_creativity() computes surprise per story from sentence-level semantic shifts. Novelty (distinctiveness vs. the other stories in the run) is only defined when at least two stories parse successfully; the aggregate score then includes novelty_scores and avg_novelty.
| Constructor | AIModel(model: str) — OpenRouter model slug |
| Method | `ask(prompt: str, reasoning: bool = True) -> tuple[str, str |
| Client | base_url="https://openrouter.ai/api/v1" |
| Argument | Description |
|---|---|
model_name |
OpenRouter model slug |
benchmark_name |
cognitive flexibility or creativity |
reasoning |
Passed to every ask() (default True) |
Raises KeyError for unknown benchmarks. Raises ValueError if prompt and variant counts differ.
| Name | Role |
|---|---|
cog_flex_prompts() |
List of 5 Meta-Chess prompts |
creativity_prompts() |
List of 5 story prompts |
build_cog_flex_prompt(white, black) |
Single chess prompt |
build_creativity_prompt(w1, w2, w3) |
Single story prompt |
JSON_START, JSON_END |
Sentinel strings for chess JSON output |
| Function | Returns |
|---|---|
parse_chess_output(raw) |
6×8 SAN matrix, or None (JSON sentinels first, then list literal) |
parse_creativity_output(raw) |
{"story": str, "sentence_count": int}, or None if empty after strip |
| Function | Aggregate score keys (high level) |
|---|---|
score_chess(parsed_list) |
compliance_rate, total_violations, total_moves, failed_games, scored_games, games, … |
score_creativity(parsed_list) |
stories_scored, surprise_scores, avg_surprise, novelty_scores, avg_novelty |
| Function | Returns |
|---|---|
check_openrouter_api_key_exist() |
True if key is set |
put_openrouter_api_key_into_env() |
Prompts and sets key if missing |
from miniBen import BENCHMARKS
# Each entry: name, prompts(), variants(), parser, scorer(content: str, reasoning_text: str | None){
"model": str,
"benchmark": str,
"num_calls": int, # len(prompts), typically 5
"variants": list[dict], # per-call metadata
"parsed": list, # one entry per call (None if parse fails)
"outputs": list[dict], # see below
"score": Any, # aggregate score from scorer(parsed)
}Each outputs item:
{
"call": int, # 1 … num_calls
"variant": dict, # e.g. {"white": "e4", "black": "e5"}
# or {"words": ["stamp", "letter", "send"]}
"content": str, # model reply text
"reasoning": str | None,
"parsed": Any | None,
}Parse failures are logged as warnings; that call’s parsed is None and the run continues.
Cognitive flexibility (score_chess):
{
"compliance_rate": float, # 0.0–1.0; parse-failed games excluded
"total_violations": int,
"total_moves": int,
"failed_games": int,
"scored_games": int,
"per_trial_violations": list[int],
"games": list[dict], # per-game detail
}Creativity (score_creativity):
{
"stories_scored": int,
"surprise_scores": list[float], # one per successfully parsed story
"avg_surprise": float,
"novelty_scores": list[float] | None, # None if fewer than 2 stories
"avg_novelty": float | None,
}mini_benchmarks/
├── .miniben/
│ └── run_history.json
├── docs/
│ ├── vignette.ipynb
│ ├── vignette.md
│ └── workflow.md
├── src/
│ ├── miniBen/
│ │ ├── __init__.py
│ │ ├── auth.py
│ │ ├── example.py
│ │ ├── job_runner.py
│ │ ├── model.py
│ │ ├── parsers.py
│ │ ├── prompts.py
│ │ ├── runner.py
│ │ ├── scorers.py
│ │ ├── streamlit_app.py
│ │ └── ui_helpers.py
│ └── help.py
├── tests/
│ ├── conftest.py
│ ├── test_auth.py
│ ├── test_job_runner.py
│ ├── test_model.py
│ ├── test_parsers.py
│ ├── test_runner.py
│ ├── test_scorers.py
│ └── test_ui_helpers.py
├── .env
├── LICENSE
├── pyproject.toml
└── README.md
Import from: from miniBen import …
CLI: python -m miniBen.example
pytestMocks OpenAI — no API key or network needed.
- Add variant constants and a
*_prompts() -> list[str]factory inprompts.py. - Implement
parse_*andscore_*. - Add a
BENCHMARKSentry inrunner.pywithprompts,variants,parser,scorer. - Export from
__init__.pyif public.
Use slugs from OpenRouter models. If reasoning breaks, pass reasoning=False.
| Issue | Fix |
|---|---|
| 401 / missing key | Set OPENROUTER_API_KEY in .env or environment |
KeyError: Unknown benchmark |
Use cognitive flexibility or creativity exactly |
ValueError: None content block |
Try another model or reasoning=False |
parsed is often None (chess) |
Prefer JSON between JSON_START / JSON_END with {"trials": [[...], ...]} (6×8 strings); or a bare 6×8 list literal of SAN moves |
avg_novelty is n/a (creativity) |
Need at least two successfully parsed stories in the same run |
[E050] Can't find model 'en_core_web_sm' |
Run python -m spacy download en_core_web_sm or reinstall: pip install -e . |
ImportError |
pip install -e . from repo root |
| Rate limits | Wait or switch model on OpenRouter |
@software{yang2026miniben,
author = {Leyla Yang},
title = {miniBen: Mini LLM Benchmarks},
year = {2026},
url = {https://github.com/Programming-The-Next-Step-2026/mini_benchmarks}
}| Task | Credit |
|---|---|
| Meta-Chess / cognitive flexibility | Original to this project |
| Creativity | Adapted from creative-story-gen — arXiv:2411.02316 |
@misc{ismayilzada2024evaluatingcreativeshortstory,
title = {Evaluating Creative Short Story Generation in Humans and Large Language Models},
author = {Mete Ismayilzada and Claire Stevenson and Lonneke van der Plas},
year = {2024},
eprint = {2411.02316},
archivePrefix = {arXiv},
primaryClass = {cs.CL},
url = {https://arxiv.org/abs/2411.02316}
}MIT — see LICENSE. Upstream benchmark repos have their own licenses when you integrate them.