Overview
openbench supports a plugin system via Python entry points, allowing you to:- Distribute custom benchmarks as standalone Python packages
- Override built-in benchmarks with patched or enhanced versions
- Share benchmarks across teams without modifying openbench source
- Version control benchmark implementations independently
pyproject.toml entry points. Once installed, they appear in bench list and work seamlessly with all CLI commands.
Quick Start
1. Create Your Package Structure
2. Define Your Benchmark
src/my_benchmarks/my_eval.py
3. Create Benchmark Metadata
src/my_benchmarks/metadata.py
4. Register Entry Point
pyproject.toml
5. Install and Use
Entry Point Specifications
Single Benchmark Registration
Register one benchmark with the entry point name:pyproject.toml
my_pkg/metadata.py
Entry points should reference a callable function that returns
BenchmarkMetadata or dict[str, BenchmarkMetadata].
The function will be called automatically when loading entry points.Multiple Benchmarks Registration
Register multiple benchmarks from one entry point:pyproject.toml
my_pkg/metadata.py
BenchmarkMetadata Fields
Human-readable display name shown in
bench list and bench describeDetailed description of what the benchmark evaluates
Category for grouping. Common values:
"core", "community", "math", "cybersecurity"Tags for searchability (e.g.,
["multiple-choice", "reasoning", "knowledge"])Python import path to the module containing the task function (e.g.,
"my_pkg.evals.mmlu")Name of the
@task decorated function to load (e.g., "mmlu")Mark as experimental/alpha. Requires
--alpha flag to run.Overriding Built-in Benchmarks
Entry points are merged after built-in benchmarks, allowing you to override them. This is useful for:- Fixing dataset bugs discovered in production
- Adding custom splits or subsets
- Swapping scoring implementations (e.g., using a different grader model)
- Patching behavior without waiting for upstream fixes
Example: Override MMLU with Custom Version
pyproject.toml
my_pkg/custom_mmlu.py
bench eval mmlu will use your custom version.
Troubleshooting
Benchmark Not Appearing
If your benchmark doesn’t show up inbench list:
- Verify installation:
pip list | grep my-benchmark-package - Check entry point registration:
- Check for errors: Look for warnings in CLI output
- Verify metadata function: Ensure it returns
BenchmarkMetadataordict[str, BenchmarkMetadata]
Import Errors
If you seeModuleNotFoundError:
- Ensure
module_pathinBenchmarkMetadatais correct - Check that your package is properly installed
- Verify the task function exists and is decorated with
@task
Override Not Working
If your override isn’t taking effect:- Entry points are loaded at package import time
- Restart your Python session / reinstall the package
- Check that the entry point name exactly matches the built-in name