Skip to main content
Want to stay updated on new releases? Sign up for openbench updates.

0.4.1 (2025-08-29)

Bug Fixes

  • rootly_gmcq: handle both string and list content types in scorer (#129) (376624d)

0.4.0 (2025-08-28)

Features

  • add boolq (#70) (edbd1cc)
  • add BrowseComp (#118) (498c706)
  • add CITATION.cff for software citation (#102) (16960de)
  • add CTI-Bench cybersecurity benchmark suite (#96) (8465075)
  • add GitHub issue and PR templates (#103) (68f0ef0)
  • add gmcq (#114) (bb3c89d)
  • add MuSR variants and grouped metrics (#107) (10ae935)
  • add robust answer extraction scorers from gpt-oss to MathArena benchmarks and gpqa_diamond (#97) (251ba66)
  • add Vercel AI Gateway inference provider (#98) (38e211a)
  • jsonschemabench (#95) (e3d842d)
  • mmmu: added support for mmmu benchmark and all of its subdomains (#121) (801bceb)

Bug Fixes

  • format mmlu_pro.py dataset file (2a9ee65)
  • handle skipped integration tests in CI (#120) (dae9378)
  • hle: added multimodal support for hle (#128) (8c3f212)
  • jsonschemaeval: match paper methodology and add openai subset (#113) (1b6470b)
  • make claude-code-review job optional to prevent PR blocking (#100) (6aad080)

Documentation

  • emphasize pre-commit hooks installation requirement (#106) (e765464)
  • refresh CONTRIBUTING.md and update README references (#105) (bf66747)
  • update installation instructions and clarify dependency architecture in CLAUDE.md and CONTRIBUTING.md (#126) (cd962fd)
  • update README citation to match CITATION.cff (#104) (6219e8c)

Chores

CI

  • add automated PyPI publishing to release workflow (#99) (eddbf70)

0.3.0 (2025-08-14)

Features

  • add —debug flag to eval-retry command (b26afaa)
  • add -M and -T flags for model and task arguments (#75) (46a6ba6)
  • add ‘openbench’ as alternative CLI entry point (#48) (68b3c5b)
  • add AI21 Labs inference provider (#86) (db7bde7)
  • add Baseten inference provider (#79) (696e2aa)
  • add Cerebras and SambaNova model providers (1c61f59)
  • add Cohere inference provider (#90) (8e6e838)
  • add Crusoe inference provider (#84) (3d0c794)
  • add DeepInfra inference provider (#85) (6fedf53)
  • add Friendli inference provider (#88) (7e2b258)
  • Add huggingface inference provider (#54) (f479703)
  • add Hyperbolic inference provider (#80) (4ebf723)
  • add initial GraphWalks benchmark implementation (#58) (1aefd07)
  • add Lambda AI inference provider (#81) (b78c346)
  • add MiniMax inference provider (#87) (09fd27b)
  • add Moonshot inference provider (#91) (e5743cb)
  • add Nebius model provider (#47) (ba2ec19)
  • add Nous Research model provider (#49) (32dd815)
  • add Novita AI inference provider (#82) (6f5874a)
  • add Parasail inference provider (#83) (973c7b3)
  • add Reka inference provider (#89) (1ab9c53)
  • add SciCode (#63) (3650bfa)
  • add support for alpha benchmarks in evaluation commands (#92) (e2ccfaa)
  • push eval data to huggingface repo (#65) (acc600f)

Bug Fixes

  • add missing newline at end of novita.py (ef0fa4b)
  • remove default sampling parameters from CLI (#72) (978638a)

Documentation

  • docs for 0.3.0 (#93) (fe358bb)
  • fix directory structure documentation in CONTRIBUTING.md (#78) (41f8ed9)

Chores

  • fix GraphWalks: Split into three separate benchmarks (#76) (d1ed96e)
  • update version (8b7bbe7)

Refactor

  • move task loading from registry to config and update imports (de6eea2)

CI

  • Enhance Claude code review workflow with updated prompts and model specification (#71) (b605ed2)

0.2.0 (2025-08-11)

Features

  • add DROP (simple-evals) (#20) (f85bf19)
  • add Humanity’s Last Exam (HLE) benchmark (#23) (6f10fb7)
  • add MATH and MATH-500 benchmarks for mathematical problem solving (#22) (9c6843b)
  • add MGSM (#18) (bec1a7c)
  • add openai MRCR benchmark for long context recall (#24) (1b09ebd)
  • HealthBench (#16) (2caa47d)

Documentation

  • update CLAUDE.md with pre-commit and dependency pinning requirements (f33730e)

Chores

  • GitHub Terraform: Create/Update .github/workflows/stale.yaml [skip ci] (1a00342)

0.1.1 (2025-07-31)

Bug Fixes

  • add missing init.py files and fix package discovery for PyPI (#10) (29fcdf6)

Documentation

  • update README to streamline setup instructions for OpenBench, use pypi (16e08a0)

0.1.0 (2025-07-31)

Features

Chores

  • ci: update release-please workflow to allow label management (b70db16)
  • drop versions for release (58ce995)
  • GitHub Terraform: Create/Update .github/workflows/stale.yaml [skip ci] (555658a)
  • update project metadata for version 0.1.0, add license, readme, and repository links (9ea2102)