Quick Tips

Start Small

Always test with --limit 10 before running full benchmarks

Use Model & Task Flags

Use -M for model and -T for any benchmark-specific arguments

Debug Mode

Use --debug for full stack tracing when troubleshooting

Detailed Breakdown

Use bench view for detailed sample-by-sample evaluation breakdown

Global Help

Use --help on any command to see all available options

Use Groq for Testing

Free tier with fast inference - perfect for development

Common Issues & Solutions

Command 'bench' not found, import errors, or missing dependencies

Package not properly installed, try:

# Reinstall with pip
pip install --upgrade openbench

# Or if using UV
uv sync --dev

Environment variables not working

export BENCH_MODEL="groq/llama-3.3-70b"  # ✓ Correct
BENCH_MODEL="groq/llama-3.3-70b"         # ✗ Wrong

Configuration precedence confusion

Remember: Command-line arguments override environment variables

export BENCH_MODEL="model-a"
bench eval mmlu --model model-b  # Uses model-b, not model-a

Reasoning effort not applied or invalid

The reasoning_effort parameter is now a first-class CLI flag.

# Correct (for models that support reasoning effort)
bench eval simpleqa --model openai/o3-2025-04-16 --reasoning-effort high

# Deprecated example
bench eval simpleqa --model openai/o3-2025-04-16 -M reasoning_effort=high

Runtime Errors

Error	Cause	Solution
`API key not found`	Missing credentials	Set `OPENAI_API_KEY` or relevant env var
`Rate limit exceeded`	Too many parallel requests	Reduce `--max-connections`
`Model not found`	Invalid model name	Check provider documentation
`Timeout`	Slow model responses	Increase `--timeout`
`Out of memory`	Large benchmark/batch	Use `--limit` to reduce size

Still Need Help?

GitHub Issues: Report bugs or ask questions

Getting Started

Benchmarks

CLI Reference

Development

Tips and Troubleshooting

Quick Tips

Start Small

Use Model & Task Flags

Debug Mode

Detailed Breakdown

Global Help

Use Groq for Testing

Common Issues & Solutions

Runtime Errors

Still Need Help?

Getting Started

Benchmarks

CLI Reference

Development

​Quick Tips

Start Small

Use Model & Task Flags

Debug Mode

Detailed Breakdown

Global Help

Use Groq for Testing

​Common Issues & Solutions

​Runtime Errors

​Still Need Help?

Quick Tips

Common Issues & Solutions

Runtime Errors

Still Need Help?