Skip to main content

Quick Tips

Start Small

Always test with --limit 10 before running full benchmarks

Use Model & Task Flags

Use -M for model and -T for any benchmark-specific arguments

Debug Mode

Use --debug for full stack tracing when troubleshooting

Detailed Breakdown

Use bench view for detailed sample-by-sample evaluation breakdown

Global Help

Use --help on any command to see all available options

Use Groq for Testing

Free tier with fast inference - perfect for development

Common Issues & Solutions

Package not properly installed, try:
# Reinstall with pip
pip install --upgrade openbench

# Or if using UV
uv sync --dev
export BENCH_MODEL="groq/llama-3.3-70b"  # ✓ Correct
BENCH_MODEL="groq/llama-3.3-70b"         # ✗ Wrong
Remember: Command-line arguments override environment variables
export BENCH_MODEL="model-a"
bench eval mmlu --model model-b  # Uses model-b, not model-a
The reasoning_effort parameter is now a first-class CLI flag.
# Correct (for models that support reasoning effort)
bench eval simpleqa --model openai/o3-2025-04-16 --reasoning-effort high

# Deprecated example
bench eval simpleqa --model openai/o3-2025-04-16 -M reasoning_effort=high

Runtime Errors

ErrorCauseSolution
API key not foundMissing credentialsSet OPENAI_API_KEY or relevant env var
Rate limit exceededToo many parallel requestsReduce --max-connections
Model not foundInvalid model nameCheck provider documentation
TimeoutSlow model responsesIncrease --timeout
Out of memoryLarge benchmark/batchUse --limit to reduce size

Still Need Help?

GitHub Issues: Report bugs or ask questions