Quickstart

Your First Evaluation

Let’s run a complete benchmark evaluation from start to finish. We’ll evaluate a model on MMLU, one of the most popular benchmarks for testing general knowledge.

Step 1: Install openbench

# Install UV if you haven't already
curl -LsSf https://astral.sh/uv/install.sh | sh

# Create environment and install openbench

uv venv
source .venv/bin/activate
uv pip install openbench

Step 2: Set Your API Key

Set environment variables:

Groq
OpenAI
Anthropic
OpenRouter

export GROQ_API_KEY="gsk_..."

Or create a .env file in your project directory:

.env

# Model Provider API Keys
GROQ_API_KEY="gsk_..."
OPENAI_API_KEY="sk_..."
ANTHROPIC_API_KEY="sk-ant-..."

See supported providers.

Step 3: Run Your First Benchmark

# Quick test with 10 questions
bench eval mmlu --model groq/llama-3.3-70b-versatile --limit 10

The --limit 10 flag runs only 10 questions for a quick test. Remove it to run the full benchmark (14,042 questions). Learn more about configuration flags.

Understanding the Output

When you run a benchmark, you’ll see:

Progress Bar: Shows evaluation progress in real-time
Live Metrics: Accuracy updates as questions are answered
Final Results: Overall accuracy and benchmark-specific metrics
Log Location: Where detailed results are saved

Example Results:

╭────────────────────────────────────────────────────────────╮
│mmlu (30 samples): openai/gpt-4.1                                                              │
╰────────────────────────────────────────────────────────────╯
timeout: 10000, max_connections: 10, temperature: 0.5, log_buffer: 10, dataset: mmlu_simple_eval                             
                                                                                                                             
total time:                0:00:11                                                                                           
openai/gpt-4.1             12,220 tokens [I: 3,242, CW: 0, CR: 0, O: 8,978, R: 0]                                            
                                                                                                                             
mcq_scorer                                                                                                                   
accuracy       0.900                                                                                                         
stderr         0.056                                                                                                         
std            0.305                                                                                                         
stem_accuracy  0.900                                                                                                         
stem_stderr    0.056                                                                                                         
stem_std       0.305                                                                                                         
                                                                                                                             
Log: logs/2025-09-12T18-44-17-04-00_mmlu_KnXAbsNXcoQapiDi6PndUR.eval                                                         
                                                                                                                             
Evaluation complete!

Viewing and Analyzing Results

Interactive Viewer

Launch the web-based viewer for detailed sample-by-sample results with:

bench view

Log Files

Results are saved as .eval files in ./logs/. The exact file path is denoted as part of the eval output:

Log: logs/2025-09-17T21-39-21-04-00_mmlu_kikHACevKMYey8r5iWEG96.eval

Getting Started

Benchmarks

CLI Reference

Development

Your First Evaluation

Step 1: Install openbench

Step 2: Set Your API Key

Step 3: Run Your First Benchmark

Understanding the Output

Viewing and Analyzing Results

Interactive Viewer

Log Files

Getting Started

Benchmarks

CLI Reference

Development

​Your First Evaluation

​Step 1: Install openbench

​Step 2: Set Your API Key

​Step 3: Run Your First Benchmark

​Understanding the Output

​Viewing and Analyzing Results

​Interactive Viewer

​Log Files

Your First Evaluation

Step 1: Install openbench

Step 2: Set Your API Key

Step 3: Run Your First Benchmark

Understanding the Output

Viewing and Analyzing Results

Interactive Viewer

Log Files