miniBen — Mini LLM Benchmarks

miniBen evaluates Large Language Models on small benchmarks through OpenRouter. Each benchmark runs several prompt variants, sends them to a model, parses replies, and aggregates a score.

Status: run_benchmark() and AIModel.ask() work end-to-end. Both benchmarks have parsers and scorers. Chess replies are parsed from JSON sentinels (JSON_START / JSON_END) when present, with a fallback to a valid 6×8 Python list literal in the reply text.

How it works

prompts.py  →  runner.run_benchmark()  →  model.AIModel.ask()  →  OpenRouter
                                                                      ↓
                                          scorers.score_*()  ←  parsers.parse_*()

For cognitive flexibility and creativity, run_benchmark():

Builds 5 prompts (one per variant — opening or word set).
Calls the API 5 times via AIModel.ask().
Parses each reply (errors are caught; failed parses become None).
Scores each parse with the benchmark scorer.

Requirements

Python 3.9+
OpenRouter API key
Network access for live runs

Installation

git clone https://github.com/Programming-The-Next-Step-2026/mini_benchmarks.git
cd mini_benchmarks
python -m venv .venv
source .venv/bin/activate          # Windows: .venv\Scripts\activate
pip install -e ".[dev]"

Creativity scoring needs the spaCy English model (installed automatically with the package above). If you still see Can't find model 'en_core_web_sm', run:

python -m spacy download en_core_web_sm

Verify:

python -c "from miniBen import BENCHMARKS; print(list(BENCHMARKS))"
# ['cognitive flexibility', 'creativity']

API key

Set OPENROUTER_API_KEY (loaded from .env via python-dotenv on import).

.env in project root:

OPENROUTER_API_KEY=sk-or-v1-your-key-here

Shell:

export OPENROUTER_API_KEY="sk-or-v1-your-key-here"

Python:

from miniBen import check_openrouter_api_key_exist, put_openrouter_api_key_into_env

if not check_openrouter_api_key_exist():
    put_openrouter_api_key_into_env()

Keys: openrouter.ai/keys. Never commit secrets.

Quick start

Streamlit UI (web app)

Install the optional UI extra:

pip install -e ".[ui]"

Run the web app:

streamlit run src/miniBen/streamlit_app.py

Command line (`example.py`)

python -m miniBen.example --list
python -m miniBen.example --benchmark creativity --model openrouter/free
python -m miniBen.example --benchmark "cognitive flexibility" --reasoning

Full benchmark (5 API calls)

from miniBen import run_benchmark

results = run_benchmark(
    model_name="openrouter/free",
    benchmark_name="creativity",
    reasoning=False,
)

print(results["num_calls"])   # 5
print(results["outputs"][0])  # per-call detail
print(results["score"])       # aggregate benchmark score

Benchmarks

`benchmark_name`	Display name	Calls	What changes each call
`cognitive flexibility`	Meta-Chess Game (Cognitive Flexibility)	5	Trial 1 opening
`creativity`	Creativity in story writing	5	Required three words

Cognitive flexibility — openings

Defined in COG_FLEX_OPENINGS (prompts.py):

#	White	Black
1	e4	e5
2	e4	c5
3	d4	d5
4	e4	e6
5	d4	c6

Each call uses build_cog_flex_prompt(white, black). The model is asked for one Meta-Chess game as JSON between <<<JSON_START>>> / <<<JSON_END>>> (see JSON_START, JSON_END in prompts.py). parse_chess_output() prefers that JSON block; if sentinels are missing but the reply still contains a valid 6×8 list-of-lists of SAN strings, the parser accepts it as a fallback.

Creativity — word triplets

Defined in CREATIVITY_WORD_TRIPLETS:

#	Words
1	stamp, letter, send
2	kitchen, cook, food
3	bike, wheel, ride
4	book, read, story
5	study, laptop, desk

Each call uses build_creativity_prompt(word1, word2, word3) for one five-sentence story. score_creativity() computes surprise per story from sentence-level semantic shifts. Novelty (distinctiveness vs. the other stories in the run) is only defined when at least two stories parse successfully; the aggregate score then includes novelty_scores and avg_novelty.

API reference

`AIModel` (`model.py`)


Constructor	`AIModel(model: str)` — OpenRouter model slug
Method	`ask(prompt: str, reasoning: bool = True) -> tuple[str, str
Client	`base_url="https://openrouter.ai/api/v1"`

`run_benchmark` (`runner.py`)

Argument	Description
`model_name`	OpenRouter model slug
`benchmark_name`	`cognitive flexibility` or `creativity`
`reasoning`	Passed to every `ask()` (default `True`)

Raises KeyError for unknown benchmarks. Raises ValueError if prompt and variant counts differ.

Prompt helpers (`prompts.py`)

Name	Role
`cog_flex_prompts()`	List of 5 Meta-Chess prompts
`creativity_prompts()`	List of 5 story prompts
`build_cog_flex_prompt(white, black)`	Single chess prompt
`build_creativity_prompt(w1, w2, w3)`	Single story prompt
`JSON_START`, `JSON_END`	Sentinel strings for chess JSON output

Parsers (`parsers.py`)

Function	Returns
`parse_chess_output(raw)`	6×8 SAN matrix, or `None` (JSON sentinels first, then list literal)
`parse_creativity_output(raw)`	`{"story": str, "sentence_count": int}`, or `None` if empty after strip

Scorers (`scorers.py`)

Function	Aggregate score keys (high level)
`score_chess(parsed_list)`	`compliance_rate`, `total_violations`, `total_moves`, `failed_games`, `scored_games`, `games`, …
`score_creativity(parsed_list)`	`stories_scored`, `surprise_scores`, `avg_surprise`, `novelty_scores`, `avg_novelty`

Auth (`auth.py`)

Function	Returns
`check_openrouter_api_key_exist()`	`True` if key is set
`put_openrouter_api_key_into_env()`	Prompts and sets key if missing

Registry (`runner.py`)

from miniBen import BENCHMARKS

# Each entry: name, prompts(), variants(), parser, scorer

Return values

`AIModel.ask()`

(content: str, reasoning_text: str | None)

`run_benchmark()`

{
    "model": str,
    "benchmark": str,
    "num_calls": int,           # len(prompts), typically 5
    "variants": list[dict],     # per-call metadata
    "parsed": list,             # one entry per call (None if parse fails)
    "outputs": list[dict],      # see below
    "score": Any,               # aggregate score from scorer(parsed)
}

Each outputs item:

{
    "call": int,              # 1 … num_calls
    "variant": dict,          # e.g. {"white": "e4", "black": "e5"}
                              # or {"words": ["stamp", "letter", "send"]}
    "content": str,           # model reply text
    "reasoning": str | None,
    "parsed": Any | None,
}

Parse failures are logged as warnings; that call’s parsed is None and the run continues.

Aggregate `score` (benchmark-specific)

Cognitive flexibility (score_chess):

{
    "compliance_rate": float,      # 0.0–1.0; parse-failed games excluded
    "total_violations": int,
    "total_moves": int,
    "failed_games": int,
    "scored_games": int,
    "per_trial_violations": list[int],
    "games": list[dict],           # per-game detail
}

Creativity (score_creativity):

{
    "stories_scored": int,
    "surprise_scores": list[float],   # one per successfully parsed story
    "avg_surprise": float,
    "novelty_scores": list[float] | None,  # None if fewer than 2 stories
    "avg_novelty": float | None,
}

Project layout

mini_benchmarks/
├── .miniben/
│   └── run_history.json
├── docs/
│   ├── vignette.ipynb
│   ├── vignette.md
│   └── workflow.md
├── src/
│   ├── miniBen/
│   │   ├── __init__.py
│   │   ├── auth.py
│   │   ├── example.py
│   │   ├── job_runner.py
│   │   ├── model.py
│   │   ├── parsers.py
│   │   ├── prompts.py
│   │   ├── runner.py
│   │   ├── scorers.py
│   │   ├── streamlit_app.py
│   │   └── ui_helpers.py
│   └── help.py
├── tests/
│   ├── conftest.py
│   ├── test_auth.py
│   ├── test_job_runner.py
│   ├── test_model.py
│   ├── test_parsers.py
│   ├── test_runner.py
│   ├── test_scorers.py
│   └── test_ui_helpers.py
├── .env
├── LICENSE
├── pyproject.toml
└── README.md

Import from: from miniBen import …
CLI: python -m miniBen.example

Development

Tests

pytest

Mocks OpenAI — no API key or network needed.

Add a benchmark

Add variant constants and a *_prompts() -> list[str] factory in prompts.py.
Implement parse_* and score_*.
Add a BENCHMARKS entry in runner.py with prompts, variants, parser, scorer.
Export from __init__.py if public.

Models

Use slugs from OpenRouter models. If reasoning breaks, pass reasoning=False.

Troubleshooting

Issue	Fix
401 / missing key	Set `OPENROUTER_API_KEY` in `.env` or environment
`KeyError: Unknown benchmark`	Use `cognitive flexibility` or `creativity` exactly
`ValueError: None content block`	Try another model or `reasoning=False`
`parsed` is often `None` (chess)	Prefer JSON between `JSON_START` / `JSON_END` with `{"trials": [[...], ...]}` (6×8 strings); or a bare 6×8 list literal of SAN moves
`avg_novelty` is `n/a` (creativity)	Need at least two successfully parsed stories in the same run
`[E050] Can't find model 'en_core_web_sm'`	Run `python -m spacy download en_core_web_sm` or reinstall: `pip install -e .`
`ImportError`	`pip install -e .` from repo root
Rate limits	Wait or switch model on OpenRouter

Citations

This package

@software{yang2026miniben,
  author = {Leyla Yang},
  title  = {miniBen: Mini LLM Benchmarks},
  year   = {2026},
  url    = {https://github.com/Programming-The-Next-Step-2026/mini_benchmarks}
}

Benchmark sources

Task	Credit
Meta-Chess / cognitive flexibility	Original to this project
Creativity	Adapted from creative-story-gen — arXiv:2411.02316

@misc{ismayilzada2024evaluatingcreativeshortstory,
  title         = {Evaluating Creative Short Story Generation in Humans and Large Language Models},
  author        = {Mete Ismayilzada and Claire Stevenson and Lonneke van der Plas},
  year          = {2024},
  eprint        = {2411.02316},
  archivePrefix = {arXiv},
  primaryClass  = {cs.CL},
  url           = {https://arxiv.org/abs/2411.02316}
}

License

MIT — see LICENSE. Upstream benchmark repos have their own licenses when you integrate them.

Name		Name	Last commit message	Last commit date
Latest commit History 31 Commits
.github/workflows		.github/workflows
.idea		.idea
.miniben		.miniben
.streamlit		.streamlit
docs		docs
src		src
tests		tests
.DS_Store		.DS_Store
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Folders and files

Latest commit

History

Repository files navigation

miniBen — Mini LLM Benchmarks

Table of contents

How it works

Requirements

Installation

API key

Quick start

Streamlit UI (web app)

Command line (example.py)

Full benchmark (5 API calls)

Benchmarks

Cognitive flexibility — openings

Creativity — word triplets

API reference

AIModel (model.py)

run_benchmark (runner.py)

Prompt helpers (prompts.py)

Parsers (parsers.py)

Scorers (scorers.py)

Auth (auth.py)

Registry (runner.py)

Return values

AIModel.ask()

run_benchmark()

Aggregate score (benchmark-specific)

Project layout

Development

Tests

Add a benchmark

Models

Troubleshooting

Citations

This package

Benchmark sources

License

Links

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Command line (`example.py`)

`AIModel` (`model.py`)

`run_benchmark` (`runner.py`)

Prompt helpers (`prompts.py`)

Parsers (`parsers.py`)

Scorers (`scorers.py`)

Auth (`auth.py`)

Registry (`runner.py`)

`AIModel.ask()`

`run_benchmark()`

Aggregate `score` (benchmark-specific)

Packages