Changelog - openbench

0.4.1 (2025-08-29)
Bug Fixes
0.4.0 (2025-08-28)
Features
Bug Fixes
Documentation
Chores
CI
0.3.0 (2025-08-14)
Features
Bug Fixes
Documentation
Chores
Refactor
CI
0.2.0 (2025-08-11)
Features
Documentation
Chores
0.1.1 (2025-07-31)
Bug Fixes
Documentation
0.1.0 (2025-07-31)
Features
Chores

Want to stay updated on new releases? Sign up for openbench updates.

0.4.1 (2025-08-29)

Bug Fixes

rootly_gmcq: handle both string and list content types in scorer (#129) (376624d)

0.4.0 (2025-08-28)

Features

add boolq (#70) (edbd1cc)
add BrowseComp (#118) (498c706)
add CITATION.cff for software citation (#102) (16960de)
add CTI-Bench cybersecurity benchmark suite (#96) (8465075)
add GitHub issue and PR templates (#103) (68f0ef0)
add gmcq (#114) (bb3c89d)
add MuSR variants and grouped metrics (#107) (10ae935)
add robust answer extraction scorers from gpt-oss to MathArena benchmarks and gpqa_diamond (#97) (251ba66)
add Vercel AI Gateway inference provider (#98) (38e211a)
jsonschemabench (#95) (e3d842d)
mmmu: added support for mmmu benchmark and all of its subdomains (#121) (801bceb)

Bug Fixes

format mmlu_pro.py dataset file (2a9ee65)
handle skipped integration tests in CI (#120) (dae9378)
hle: added multimodal support for hle (#128) (8c3f212)
jsonschemaeval: match paper methodology and add openai subset (#113) (1b6470b)
make claude-code-review job optional to prevent PR blocking (#100) (6aad080)

Documentation

emphasize pre-commit hooks installation requirement (#106) (e765464)
refresh CONTRIBUTING.md and update README references (#105) (bf66747)
update installation instructions and clarify dependency architecture in CLAUDE.md and CONTRIBUTING.md (#126) (cd962fd)
update README citation to match CITATION.cff (#104) (6219e8c)

Chores

bump Inspect-AI to 0.3.125 (#124) (d728cbb)
unpin dependencies except inspect-ai (#108) (50cf90f)
update uv.lock package version (3583d71)

CI

add automated PyPI publishing to release workflow (#99) (eddbf70)

0.3.0 (2025-08-14)

Features

add —debug flag to eval-retry command (b26afaa)
add -M and -T flags for model and task arguments (#75) (46a6ba6)
add ‘openbench’ as alternative CLI entry point (#48) (68b3c5b)
add AI21 Labs inference provider (#86) (db7bde7)
add Baseten inference provider (#79) (696e2aa)
add Cerebras and SambaNova model providers (1c61f59)
add Cohere inference provider (#90) (8e6e838)
add Crusoe inference provider (#84) (3d0c794)
add DeepInfra inference provider (#85) (6fedf53)
add Friendli inference provider (#88) (7e2b258)
Add huggingface inference provider (#54) (f479703)
add Hyperbolic inference provider (#80) (4ebf723)
add initial GraphWalks benchmark implementation (#58) (1aefd07)
add Lambda AI inference provider (#81) (b78c346)
add MiniMax inference provider (#87) (09fd27b)
add Moonshot inference provider (#91) (e5743cb)
add Nebius model provider (#47) (ba2ec19)
add Nous Research model provider (#49) (32dd815)
add Novita AI inference provider (#82) (6f5874a)
add Parasail inference provider (#83) (973c7b3)
add Reka inference provider (#89) (1ab9c53)
add SciCode (#63) (3650bfa)
add support for alpha benchmarks in evaluation commands (#92) (e2ccfaa)
push eval data to huggingface repo (#65) (acc600f)

Bug Fixes

add missing newline at end of novita.py (ef0fa4b)
remove default sampling parameters from CLI (#72) (978638a)

Documentation

docs for 0.3.0 (#93) (fe358bb)
fix directory structure documentation in CONTRIBUTING.md (#78) (41f8ed9)

Chores

fix GraphWalks: Split into three separate benchmarks (#76) (d1ed96e)
update version (8b7bbe7)

Refactor

move task loading from registry to config and update imports (de6eea2)

CI

Enhance Claude code review workflow with updated prompts and model specification (#71) (b605ed2)

0.2.0 (2025-08-11)

Features

add DROP (simple-evals) (#20) (f85bf19)
add Humanity’s Last Exam (HLE) benchmark (#23) (6f10fb7)
add MATH and MATH-500 benchmarks for mathematical problem solving (#22) (9c6843b)
add MGSM (#18) (bec1a7c)
add openai MRCR benchmark for long context recall (#24) (1b09ebd)
HealthBench (#16) (2caa47d)

Documentation

update CLAUDE.md with pre-commit and dependency pinning requirements (f33730e)

Chores

GitHub Terraform: Create/Update .github/workflows/stale.yaml [skip ci] (1a00342)

0.1.1 (2025-07-31)

Bug Fixes

add missing init.py files and fix package discovery for PyPI (#10) (29fcdf6)

Documentation

update README to streamline setup instructions for OpenBench, use pypi (16e08a0)

0.1.0 (2025-07-31)

Features

openbench (3265bb0)

Chores

ci: update release-please workflow to allow label management (b70db16)
drop versions for release (58ce995)
GitHub Terraform: Create/Update .github/workflows/stale.yaml [skip ci] (555658a)
update project metadata for version 0.1.0, add license, readme, and repository links (9ea2102)

Release Notes - v0.5.0

Benchmarks Catalog

⌘I