Skip to main content

Documentation Index

Fetch the complete documentation index at: https://openbench.dev/llms.txt

Use this file to discover all available pages before exploring further.

Want to stay updated on new releases? Sign up for openbench updates.

Overview

OpenBench 0.5.0 is our biggest release yet. We added 350+ evaluations, partnered with the ARC Prize Foundation to add ARC-AGI, introduced a plugin system for external benchmarks, improved provider routing (OpenRouter) and Groq support, shipped coding harnesses you can mix and match (Exercism alpha), added a tool-calling benchmark (LiveMCPBench), and made a raft of developer-experience upgrades. Highlights
  • ARC-AGI (with ARC Prize Foundation) — non-linguistic, compositional reasoning
  • 350+ new/expanded evals: Global-MMLU(42), GLUE/SuperGLUE, BLiMP, AGIEval, Arabic Exams(41), BBH(18), and more
  • Plugins for external benchmarks (dash/underscore-insensitive; override built-ins)
  • Coding (alpha): Exercism multi-language problems + multiple harnesses (aider/roo/claude/opencode)
  • Tool-calling via LiveMCPBench (evaluate reliable structured tool use)
  • Provider routing (OpenRouter); Groq provider with reasoning support + rich telemetry
  • DevX: JSON logs, HF Hub export, cache/view commands, improved results panels, run_eval returns logs
Links
  • Benchmarks Catalog → /benchmarks/catalog
  • Extending OpenBench (plugins) → /development/extending
  • CLI Overview → /cli/overview
  • Providers → /providers

ARC-AGI (with ARC Prize Foundation)

ARC-AGI focuses on “fluid intelligence”: abstraction, transformation, and rule discovery beyond textual prior knowledge. Why it matters
  • Complements text benchmarks (MMLU/BBH) with non-linguistic reasoning
  • Helps surface true generalization capability
  • Deterministic seeds and clear scoring for reproducibility
How to run
bench eval arc_agi --model groq/llama-3.3-70b
bench eval arc_agi_1 --model openai/gpt-4o
bench eval arc_agi_2 --model openrouter/deepseek/deepseek-chat-v3.1
Tips
  • Iterate quickly with --limit and control generation with --temperature, --top-p, and --seed
  • Export JSON logs with --log-format json for downstream analysis

350+ New/Expanded Evaluations

Big expansion across breadth and languages:
  • Global-MMLU (42 languages) + composite task
  • GLUE/SuperGLUE MCQ tasks (COPA, RTE, WiC, WSC, CB, MultiRC); BoolQ remains separate
  • BLiMP (68 linguistic tasks); AGIEval (18 tasks)
  • Arabic Exams (41 subsets)
  • Reading comprehension: RACE, QA4MRE, QASPER, DROP
  • Knowledge QA: TruthfulQA, LogiQA, SciQ, MathQA
  • Code: Exercism (Python/JS/Go/Java/Rust), HumanEval, MBPP
  • Multimodal: MMMU + MMMU Pro (MCQ/open/vision), MMMLU (multilingual MMLU), MMStar
  • Math: MathArena (AIME/HMMT/BRUMO 2023→2025), MATH/MATH-500, MGSM; Otis Mock AIME 2024/2025
  • BBH (18 tasks); additional BigBench suite tasks
Browse everything in the Catalog → /benchmarks/catalog

Plugins: External Benchmarks (Override-Capable)

Register third-party benchmarks via Python entry points — and optionally override built-ins, no forks required. pyproject.toml
[project.entry-points."openbench.benchmarks"]
my_benchmark = "my_pkg.bench:get_benchmark_metadata"
Metadata function
from openbench.utils import BenchmarkMetadata

def get_benchmark_metadata():
    return BenchmarkMetadata(
        name="My Bench",
        description="A custom benchmark",
        category="community",
        tags=["custom", "reasoning"],
        module_path="my_pkg.bench_impl",
        function_name="my_benchmark",
    )
Behavior
  • CLI is dash/underscore-insensitive (e.g., mmlu-pro == mmlu_pro)
  • Entry points are merged after built-ins; keys differing only by - vs _ are treated as the same; your entry point’s spelling wins
  • Best for internal extensions, patched datasets, or alternative scoring
Learn more → /development/extending

Coding Benchmark (Alpha): Exercism + Harness Mixing

Exercism evaluates code agents on real problems with unit tests (multi-language): Python, JavaScript, Go, Java, Rust. Harnesses
  • aider, roo, claude, opencode
  • Mix and match harnesses with any model to find the best combo
Examples
# Python with codex (default)
bench eval exercism_python --code-agent codex --model openai/gpt-5

# Go with Roo (requires OpenRouter model IDs)
bench eval exercism_go --code-agent roo \
  --model openrouter/anthropic/claude-sonnet-4-20250514
Under the hood
  • Dockerized execution and unit tests for each submission
  • Rich logs; export JSON (--log-format json) and push to Hub (--hub-repo)
  • We’ll publish research soon on harness × model interactions

Tool-Calling: LiveMCPBench

Evaluate how reliably a model plans and orchestrates tools end-to-end. Run it
bench eval livemcpbench --model openai/gpt-4o
Tips
  • Concurrency: --max-connections
  • Timeouts: --timeout
  • Fast iteration: --limit
  • Caching: OpenBench prepares embeddings/data before the run and cleans up afterward; keep the root with --keep-livemcp-root
  • Logs: Use --log-format json + --hub-repo username/openbench-logs to compare runs

Provider Routing & Groq Enhancements

OpenRouter provider
  • Fine-grained routing controls: only, order, allow_fallbacks, ignore, sort, max_price, quantizations, require_parameters, data_collection
  • Pass via model args; comma-separated values are accepted
Example
bench eval mmlu \
  --model openrouter/openai/gpt-4o \
  -M only=groq,openai -M order=openai,groq -M allow_fallbacks=true \
  -M sort=price -M quantizations=int8
Groq provider
  • Adds reasoning_effort, request IDs, detailed usage/timing metadata, executed tools, and OpenBench user-agent tagging

Developer Experience

  • JSON logs: --log-format json alongside .eval logs
  • Push results/stats/samples to the Hub: --hub-repo (see /cli/eval)
  • Cache utilities: bench cache info|ls|clear for ~/.openbench (see /cli/cache)
  • bench view to browse logs (see /cli/view)
  • Results panel: total time + sample duration (avg/p95/p50)
  • Programmatic: from openbench import run_eval returns logs

MultiChallenge (Judge-Based)

  • Strict judge model (structured YES/NO verdicts), robust parsing
  • Aggregates per-axis pass rates and an overall score
  • Supports truncating conversations via max_turns

Breaking & Behavior Changes

  • Default model is now groq/openai/gpt-oss-20b. If you relied on an older default, set --model explicitly
  • Roo code-agent requires OpenRouter model IDs (openrouter/<vendor>/<model>)
  • Benchmark names are dash/underscore-insensitive; external plugins can override built-ins by normalized name
  • BigBench/BBH are run as individual tasks (no aggregator command in final state)

Quickstarts

ARC-AGI
bench eval arc_agi --model groq/llama-3.3-70b
Global-MMLU / Arabic Exams
bench eval global_mmlu_english --model openai/gpt-4o
bench eval arabic_exams_general_knowledge --model groq/llama-3.3-70b
Exercism (alpha)
bench eval exercism_python --code-agent codex --model openai/gpt-5
MultiChallenge
bench eval multichallenge --model openai/gpt-4o --limit 50
Push logs to Hub
bench eval mmlu --model groq/llama-3.3-70b --hub-repo username/openbench-logs

Thanks

Huge thanks to the ARC Prize Foundation and our community of contributors and partners. 0.5.0 aims to help you measure what matters — with clarity, breadth, and speed. Star us on GitHub → https://github.com/groq/openbench