Documentation Index
Fetch the complete documentation index at: https://openbench.dev/llms.txt
Use this file to discover all available pages before exploring further.
Want to stay updated on new releases? Sign up for openbench updates.
Overview
OpenBench 0.5.0 is our biggest release yet. We added 350+ evaluations, partnered with the ARC Prize Foundation to add ARC-AGI, introduced a plugin system for external benchmarks, improved provider routing (OpenRouter) and Groq support, shipped coding harnesses you can mix and match (Exercism alpha), added a tool-calling benchmark (LiveMCPBench), and made a raft of developer-experience upgrades. Highlights- ARC-AGI (with ARC Prize Foundation) — non-linguistic, compositional reasoning
- 350+ new/expanded evals: Global-MMLU(42), GLUE/SuperGLUE, BLiMP, AGIEval, Arabic Exams(41), BBH(18), and more
- Plugins for external benchmarks (dash/underscore-insensitive; override built-ins)
- Coding (alpha): Exercism multi-language problems + multiple harnesses (aider/roo/claude/opencode)
- Tool-calling via LiveMCPBench (evaluate reliable structured tool use)
- Provider routing (OpenRouter); Groq provider with reasoning support + rich telemetry
- DevX: JSON logs, HF Hub export, cache/view commands, improved results panels,
run_evalreturns logs
- Benchmarks Catalog → /benchmarks/catalog
- Extending OpenBench (plugins) → /development/extending
- CLI Overview → /cli/overview
- Providers → /providers
ARC-AGI (with ARC Prize Foundation)
ARC-AGI focuses on “fluid intelligence”: abstraction, transformation, and rule discovery beyond textual prior knowledge. Why it matters- Complements text benchmarks (MMLU/BBH) with non-linguistic reasoning
- Helps surface true generalization capability
- Deterministic seeds and clear scoring for reproducibility
- Iterate quickly with
--limitand control generation with--temperature,--top-p, and--seed - Export JSON logs with
--log-format jsonfor downstream analysis
350+ New/Expanded Evaluations
Big expansion across breadth and languages:- Global-MMLU (42 languages) + composite task
- GLUE/SuperGLUE MCQ tasks (COPA, RTE, WiC, WSC, CB, MultiRC); BoolQ remains separate
- BLiMP (68 linguistic tasks); AGIEval (18 tasks)
- Arabic Exams (41 subsets)
- Reading comprehension: RACE, QA4MRE, QASPER, DROP
- Knowledge QA: TruthfulQA, LogiQA, SciQ, MathQA
- Code: Exercism (Python/JS/Go/Java/Rust), HumanEval, MBPP
- Multimodal: MMMU + MMMU Pro (MCQ/open/vision), MMMLU (multilingual MMLU), MMStar
- Math: MathArena (AIME/HMMT/BRUMO 2023→2025), MATH/MATH-500, MGSM; Otis Mock AIME 2024/2025
- BBH (18 tasks); additional BigBench suite tasks
Plugins: External Benchmarks (Override-Capable)
Register third-party benchmarks via Python entry points — and optionally override built-ins, no forks required. pyproject.toml- CLI is dash/underscore-insensitive (e.g.,
mmlu-pro==mmlu_pro) - Entry points are merged after built-ins; keys differing only by
-vs_are treated as the same; your entry point’s spelling wins - Best for internal extensions, patched datasets, or alternative scoring
Coding Benchmark (Alpha): Exercism + Harness Mixing
Exercism evaluates code agents on real problems with unit tests (multi-language): Python, JavaScript, Go, Java, Rust. Harnessesaider,roo,claude,opencode- Mix and match harnesses with any model to find the best combo
- Dockerized execution and unit tests for each submission
- Rich logs; export JSON (
--log-format json) and push to Hub (--hub-repo) - We’ll publish research soon on harness × model interactions
Tool-Calling: LiveMCPBench
Evaluate how reliably a model plans and orchestrates tools end-to-end. Run it- Concurrency:
--max-connections - Timeouts:
--timeout - Fast iteration:
--limit - Caching: OpenBench prepares embeddings/data before the run and cleans up afterward; keep the root with
--keep-livemcp-root - Logs: Use
--log-format json+--hub-repo username/openbench-logsto compare runs
Provider Routing & Groq Enhancements
OpenRouter provider- Fine-grained routing controls:
only,order,allow_fallbacks,ignore,sort,max_price,quantizations,require_parameters,data_collection - Pass via model args; comma-separated values are accepted
- Adds
reasoning_effort, request IDs, detailed usage/timing metadata, executed tools, and OpenBench user-agent tagging
Developer Experience
- JSON logs:
--log-format jsonalongside.evallogs - Push results/stats/samples to the Hub:
--hub-repo(see /cli/eval) - Cache utilities:
bench cache info|ls|clearfor~/.openbench(see /cli/cache) bench viewto browse logs (see /cli/view)- Results panel: total time + sample duration (avg/p95/p50)
- Programmatic:
from openbench import run_evalreturns logs
MultiChallenge (Judge-Based)
- Strict judge model (structured YES/NO verdicts), robust parsing
- Aggregates per-axis pass rates and an overall score
- Supports truncating conversations via
max_turns
Breaking & Behavior Changes
- Default model is now
groq/openai/gpt-oss-20b. If you relied on an older default, set--modelexplicitly - Roo code-agent requires OpenRouter model IDs (
openrouter/<vendor>/<model>) - Benchmark names are dash/underscore-insensitive; external plugins can override built-ins by normalized name
- BigBench/BBH are run as individual tasks (no aggregator command in final state)