Want to stay updated on new releases? Sign up for openbench updates.
0.4.1 (2025-08-29)
Bug Fixes
0.4.0 (2025-08-28)
Features
- add boolq (#70) (edbd1cc)
- add BrowseComp (#118) (498c706)
- add CITATION.cff for software citation (#102) (16960de)
- add CTI-Bench cybersecurity benchmark suite (#96) (8465075)
- add GitHub issue and PR templates (#103) (68f0ef0)
- add gmcq (#114) (bb3c89d)
- add MuSR variants and grouped metrics (#107) (10ae935)
- add robust answer extraction scorers from gpt-oss to MathArena benchmarks and gpqa_diamond (#97) (251ba66)
- add Vercel AI Gateway inference provider (#98) (38e211a)
- jsonschemabench (#95) (e3d842d)
- mmmu: added support for mmmu benchmark and all of its subdomains (#121) (801bceb)
Bug Fixes
- format mmlu_pro.py dataset file (2a9ee65)
- handle skipped integration tests in CI (#120) (dae9378)
- hle: added multimodal support for hle (#128) (8c3f212)
- jsonschemaeval: match paper methodology and add openai subset (#113) (1b6470b)
- make claude-code-review job optional to prevent PR blocking (#100) (6aad080)
Documentation
- emphasize pre-commit hooks installation requirement (#106) (e765464)
- refresh CONTRIBUTING.md and update README references (#105) (bf66747)
- update installation instructions and clarify dependency architecture in CLAUDE.md and CONTRIBUTING.md (#126) (cd962fd)
- update README citation to match CITATION.cff (#104) (6219e8c)
Chores
- bump Inspect-AI to 0.3.125 (#124) (d728cbb)
- unpin dependencies except inspect-ai (#108) (50cf90f)
- update uv.lock package version (3583d71)
CI
0.3.0 (2025-08-14)
Features
- add —debug flag to eval-retry command (b26afaa)
- add -M and -T flags for model and task arguments (#75) (46a6ba6)
- add ‘openbench’ as alternative CLI entry point (#48) (68b3c5b)
- add AI21 Labs inference provider (#86) (db7bde7)
- add Baseten inference provider (#79) (696e2aa)
- add Cerebras and SambaNova model providers (1c61f59)
- add Cohere inference provider (#90) (8e6e838)
- add Crusoe inference provider (#84) (3d0c794)
- add DeepInfra inference provider (#85) (6fedf53)
- add Friendli inference provider (#88) (7e2b258)
- Add huggingface inference provider (#54) (f479703)
- add Hyperbolic inference provider (#80) (4ebf723)
- add initial GraphWalks benchmark implementation (#58) (1aefd07)
- add Lambda AI inference provider (#81) (b78c346)
- add MiniMax inference provider (#87) (09fd27b)
- add Moonshot inference provider (#91) (e5743cb)
- add Nebius model provider (#47) (ba2ec19)
- add Nous Research model provider (#49) (32dd815)
- add Novita AI inference provider (#82) (6f5874a)
- add Parasail inference provider (#83) (973c7b3)
- add Reka inference provider (#89) (1ab9c53)
- add SciCode (#63) (3650bfa)
- add support for alpha benchmarks in evaluation commands (#92) (e2ccfaa)
- push eval data to huggingface repo (#65) (acc600f)
Bug Fixes
- add missing newline at end of novita.py (ef0fa4b)
- remove default sampling parameters from CLI (#72) (978638a)
Documentation
- docs for 0.3.0 (#93) (fe358bb)
- fix directory structure documentation in CONTRIBUTING.md (#78) (41f8ed9)
Chores
Refactor
- move task loading from registry to config and update imports (de6eea2)
CI
0.2.0 (2025-08-11)
Features
- add DROP (simple-evals) (#20) (f85bf19)
- add Humanity’s Last Exam (HLE) benchmark (#23) (6f10fb7)
- add MATH and MATH-500 benchmarks for mathematical problem solving (#22) (9c6843b)
- add MGSM (#18) (bec1a7c)
- add openai MRCR benchmark for long context recall (#24) (1b09ebd)
- HealthBench (#16) (2caa47d)
Documentation
- update CLAUDE.md with pre-commit and dependency pinning requirements (f33730e)
Chores
- GitHub Terraform: Create/Update .github/workflows/stale.yaml [skip ci] (1a00342)
0.1.1 (2025-07-31)
Bug Fixes
Documentation
- update README to streamline setup instructions for OpenBench, use pypi (16e08a0)
0.1.0 (2025-07-31)
Features
- openbench (3265bb0)