Configuration

Configuration Methods

openbench can be configured through multiple methods, in order of precedence:

Command-Line Arguments

# Run eval with --config CLI flags
bench mmlu --max_tokens=10000 --temperature=0.5

See what else you can do with the openbench CLI.

Environment Variables

# Set configuration defaults
export BENCH_MAX_TOKENS=10000
export BENCH_TEMPERATURE=0.5

Configuration Files (.env files)

.env

# Write default values to config file
BENCH_MAX_TOKENS=10000
BENCH_TEMPERATURE=0.5

# Automatically loaded if in current directory
bench eval mmlu

# Or source manually
source .env
bench eval mmlu

If not defined through any of the above methods, parameters resort to default values.

Configuration Parameters

Model Configuration

CLI Command	Environment Variable	Description
`--model`	`BENCH_MODEL`	Model(s) to evaluate.
`--model-base-url`	`BENCH_MODEL_BASE_URL`	Base URL for model(s).
`--model-role`	`BENCH_MODEL_ROLE`	Map role(s) to specific models.

Some providers may support additional parameters. See provider-specific model configuration for extended features.

Task Configuration

CLI Command	Environment Variable	Description
`--epochs`	`BENCH_EPOCHS`	Number of epochs to run each evaluation.
`--epochs-reducer`	`BENCH_EPOCHS_REDUCER`	Reducer(s) applied when aggregating epoch scores.
`--limit`	`BENCH_LIMIT`	Limit evaluated samples (single number or range).
`--message-limit`	`BENCH_MESSAGE_LIMIT`	Maximum number of messages per sample.
`--score / --no-score`	`BENCH_SCORE`	Grade the benchmark, or skip scoring (can score later).

Use -T Flag for Other Task-Specific Arguments
(e.g. bench eval graphwalks --model ollama/llama3.1:70b -T task=parents)

Generation Settings

CLI Command	Environment Variable	Description
`--temperature`	`BENCH_TEMPERATURE`	Model sampling temperature.
`--top-p`	`BENCH_TOP_P`	Nucleus sampling.
`--max-tokens`	`BENCH_MAX_TOKENS`	Maximum tokens for model response.
`--seed`	`BENCH_SEED`	Random seed for deterministic generation.

Performance & Concurrency

CLI Command	Environment Variable	Description
`--max-connections`	`BENCH_MAX_CONNECTIONS`	Maximum number of parallel requests.
`--max-subprocesses`	`BENCH_MAX_SUBPROCESSES`	Maximum number of parallel subprocesses.
`--max-tasks`	`BENCH_MAX_TASKS`	Maximum number of tasks to run concurrently.
`--timeout`	`BENCH_TIMEOUT`	Timeout per model API request (seconds).
`--max-retries`	`BENCH_MAX_RETRIES`	Maximum number of retries for API requests.
`--retry-on-error`	`BENCH_RETRY_ON_ERROR`	Retry samples on errors (set number of retries).
`--fail-on-error`	`BENCH_FAIL_ON_ERROR`	Failure threshold for sample errors (percentage or count).
`--no-fail-on-error`	`BENCH_NO_FAIL_ON_ERROR`	Do not fail evaluation if errors occur.

Logging & Output

CLI Flag	Environment Variable	Description
`--logfile`	`BENCH_OUTPUT`	Output file for results.
`--log-format`	`BENCH_LOG_FORMAT`	Output logging format (`eval` / `json`).
`--display`	`BENCH_DISPLAY`	Display type for evaluation progress.
`--log-samples / --no-log-samples`	`BENCH_LOG_SAMPLES`	Include or exclude detailed samples in logs.
`--log-images / --no-log-images`	`BENCH_LOG_IMAGES`	Include or exclude base64 encoded images in logs.
`--log-buffer`	`BENCH_LOG_BUFFER`	Number of samples to buffer before writing logs.
`--log-dir`	`BENCH_LOG_DIR`	Directory for log files.

Sandbox & Execution

CLI Flag	Environment Variable	Description
`--sandbox`	`BENCH_SANDBOX`	Environment to run evaluation (`local` or `docker`).
`--sandbox-cleanup / --no-sandbox-cleanup`	`BENCH_SANDBOX_CLEANUP`	Cleanup sandbox environments after tasks (or skip cleanup).

Hub Integration

CLI Flag	Environment Variable	Description
`--hub-repo`	`BENCH_HUB_REPO`	Target Hub dataset repo for logs.
`--hub-private`	`BENCH_HUB_PRIVATE`	Push Hub dataset as private.

Debugging & Inspection

CLI Flag	Environment Variable	Description
`--debug`	`BENCH_DEBUG`	Enable debug mode with full stack traces.
`--debug-errors`	`BENCH_DEBUG_ERRORS`	Enable debug mode for errors only.
`--trace`	`BENCH_TRACE`	Trace message interactions with model.
`--alpha`	`BENCH_ALPHA`	Allow running experimental/alpha benchmarks.

Configuration Examples

Simple Custom Configuration

bench eval humaneval \
  --model openai/gpt-4o \
  --temperature 0.5 \
  --max-tokens 512 \
  --epochs 5 \
  --sandbox docker

Model Comparison

bench eval mmlu --limit 500 \
    --model groq/llama-3.3-70b \
    --model openai/gpt-4o \
    --model anthropic/claude-3-5-sonnet

Multi-Model Evaluation

bench eval simpleqa \
  --model-role candidate=groq/llama-3.3-70b \
  --model-role grader=openai/gpt-4o

Batch Configuration

bench eval --model openai/gpt-4.1 \
  math --limit 20 \
  mmlu --limit 20

Interruption Retry Configuration

bench eval-retry logs/incomplete.json \
  --max-retries 3 \
  --retry-on-error

Display Configuration

# Display modes
export BENCH_DISPLAY="rich"           # Rich terminal output (default)
export BENCH_DISPLAY="plain"          # Simple text output
export BENCH_DISPLAY="none"           # No output (logs only)
export BENCH_DISPLAY="conversation"   # Show full conversations

bench eval mmlu --display conversation

Getting Started

Benchmarks

CLI Reference

Development

Configuration Methods

Configuration Parameters

Model Configuration

Task Configuration

Generation Settings

Performance & Concurrency

Logging & Output

Sandbox & Execution

Hub Integration

Debugging & Inspection

Configuration Examples

Simple Custom Configuration

Model Comparison

Multi-Model Evaluation

Batch Configuration

Interruption Retry Configuration

Display Configuration

Getting Started

Benchmarks

CLI Reference

Development

​Configuration Methods

​Configuration Parameters

​Model Configuration

​Task Configuration

​Generation Settings

​Performance & Concurrency

​Logging & Output

​Sandbox & Execution

​Hub Integration

​Debugging & Inspection

​Configuration Examples

​Simple Custom Configuration

​Model Comparison

​Multi-Model Evaluation

​Batch Configuration

​Interruption Retry Configuration

​Display Configuration

Configuration Methods

Configuration Parameters

Model Configuration

Task Configuration

Generation Settings

Performance & Concurrency

Logging & Output

Sandbox & Execution

Hub Integration

Debugging & Inspection

Configuration Examples

Simple Custom Configuration

Model Comparison

Multi-Model Evaluation

Batch Configuration

Interruption Retry Configuration

Display Configuration