Want to stay updated on new releases? Sign up for openbench updates.
Overview
OpenBench 0.5.0 is our biggest release yet. We added 350+ evaluations, partnered with the ARC Prize Foundation to add ARC-AGI, introduced a plugin system for external benchmarks, improved provider routing (OpenRouter) and Groq support, shipped coding harnesses you can mix and match (Exercism alpha), added a tool-calling benchmark (LiveMCPBench), and made a raft of developer-experience upgrades. Highlights- ARC-AGI (with ARC Prize Foundation) — non-linguistic, compositional reasoning
- 350+ new/expanded evals: Global-MMLU(42), GLUE/SuperGLUE, BLiMP, AGIEval, Arabic Exams(41), BBH(18), and more
- Plugins for external benchmarks (dash/underscore-insensitive; override built-ins)
- Coding (alpha): Exercism multi-language problems + multiple harnesses (aider/roo/claude/opencode)
- Tool-calling via LiveMCPBench (evaluate reliable structured tool use)
- Provider routing (OpenRouter); Groq provider with reasoning support + rich telemetry
- DevX: JSON logs, HF Hub export, cache/view commands, improved results panels,
run_evalreturns logs
- Benchmarks Catalog → /benchmarks/catalog
- Extending OpenBench (plugins) → /development/extending
- CLI Overview → /cli/overview
- Providers → /providers
ARC-AGI (with ARC Prize Foundation)
ARC-AGI focuses on “fluid intelligence”: abstraction, transformation, and rule discovery beyond textual prior knowledge. Why it matters- Complements text benchmarks (MMLU/BBH) with non-linguistic reasoning
- Helps surface true generalization capability
- Deterministic seeds and clear scoring for reproducibility
- Iterate quickly with
--limitand control generation with--temperature,--top-p, and--seed - Export JSON logs with
--log-format jsonfor downstream analysis
350+ New/Expanded Evaluations
Big expansion across breadth and languages:- Global-MMLU (42 languages) + composite task
- GLUE/SuperGLUE MCQ tasks (COPA, RTE, WiC, WSC, CB, MultiRC); BoolQ remains separate
- BLiMP (68 linguistic tasks); AGIEval (18 tasks)
- Arabic Exams (41 subsets)
- Reading comprehension: RACE, QA4MRE, QASPER, DROP
- Knowledge QA: TruthfulQA, LogiQA, SciQ, MathQA
- Code: Exercism (Python/JS/Go/Java/Rust), HumanEval, MBPP
- Multimodal: MMMU + MMMU Pro (MCQ/open/vision), MMMLU (multilingual MMLU), MMStar
- Math: MathArena (AIME/HMMT/BRUMO 2023→2025), MATH/MATH-500, MGSM; Otis Mock AIME 2024/2025
- BBH (18 tasks); additional BigBench suite tasks
Plugins: External Benchmarks (Override-Capable)
Register third-party benchmarks via Python entry points — and optionally override built-ins, no forks required. pyproject.toml- CLI is dash/underscore-insensitive (e.g.,
mmlu-pro==mmlu_pro) - Entry points are merged after built-ins; keys differing only by
-vs_are treated as the same; your entry point’s spelling wins - Best for internal extensions, patched datasets, or alternative scoring
Coding Benchmark (Alpha): Exercism + Harness Mixing
Exercism evaluates code agents on real problems with unit tests (multi-language): Python, JavaScript, Go, Java, Rust. Harnessesaider,roo,claude,opencode- Mix and match harnesses with any model to find the best combo
- Dockerized execution and unit tests for each submission
- Rich logs; export JSON (
--log-format json) and push to Hub (--hub-repo) - We’ll publish research soon on harness × model interactions
Tool-Calling: LiveMCPBench
Evaluate how reliably a model plans and orchestrates tools end-to-end. Run it- Concurrency:
--max-connections - Timeouts:
--timeout - Fast iteration:
--limit - Caching: OpenBench prepares embeddings/data before the run and cleans up afterward; keep the root with
--keep-livemcp-root - Logs: Use
--log-format json+--hub-repo username/openbench-logsto compare runs
Provider Routing & Groq Enhancements
OpenRouter provider- Fine-grained routing controls:
only,order,allow_fallbacks,ignore,sort,max_price,quantizations,require_parameters,data_collection - Pass via model args; comma-separated values are accepted
- Adds
reasoning_effort, request IDs, detailed usage/timing metadata, executed tools, and OpenBench user-agent tagging
Developer Experience
- JSON logs:
--log-format jsonalongside.evallogs - Push results/stats/samples to the Hub:
--hub-repo(see /cli/eval) - Cache utilities:
bench cache info|ls|clearfor~/.openbench(see /cli/cache) bench viewto browse logs (see /cli/view)- Results panel: total time + sample duration (avg/p95/p50)
- Programmatic:
from openbench import run_evalreturns logs
MultiChallenge (Judge-Based)
- Strict judge model (structured YES/NO verdicts), robust parsing
- Aggregates per-axis pass rates and an overall score
- Supports truncating conversations via
max_turns
Breaking & Behavior Changes
- Default model is now
groq/openai/gpt-oss-20b. If you relied on an older default, set--modelexplicitly - Roo code-agent requires OpenRouter model IDs (
openrouter/<vendor>/<model>) - Benchmark names are dash/underscore-insensitive; external plugins can override built-ins by normalized name
- BigBench/BBH are run as individual tasks (no aggregator command in final state)