Your First Evaluation
Let’s run a complete benchmark evaluation from start to finish. We’ll evaluate a model on MMLU, one of the most popular benchmarks for testing general knowledge.Step 1: Install openbench
Step 2: Set Your API Key
Set environment variables:- Groq
- OpenAI
- Anthropic
- OpenRouter
.env file in your project directory:
.env
Step 3: Run Your First Benchmark
The
--limit 10 flag runs only 10 questions for a quick test. Remove it to run the full benchmark (14,042 questions).
Learn more about configuration flags.Understanding the Output
When you run a benchmark, you’ll see:- Progress Bar: Shows evaluation progress in real-time
- Live Metrics: Accuracy updates as questions are answered
- Final Results: Overall accuracy and benchmark-specific metrics
- Log Location: Where detailed results are saved
Viewing and Analyzing Results
Interactive Viewer
Launch the web-based viewer for detailed sample-by-sample results with:Log Files
Results are saved as .eval files in./logs/. The exact file path is denoted as part of the eval output: