Skip to main content

Configuration Methods

openbench can be configured through multiple methods, in order of precedence:
  1. Command-Line Arguments
# Run eval with --config CLI flags
bench mmlu --max_tokens=10000 --temperature=0.5
See what else you can do with the openbench CLI.
  1. Environment Variables
# Set configuration defaults
export BENCH_MAX_TOKENS=10000
export BENCH_TEMPERATURE=0.5
  1. Configuration Files (.env files)
.env
# Write default values to config file
BENCH_MAX_TOKENS=10000
BENCH_TEMPERATURE=0.5
# Automatically loaded if in current directory
bench eval mmlu

# Or source manually
source .env
bench eval mmlu
If not defined through any of the above methods, parameters resort to default values.

Configuration Parameters

Model Configuration

CLI CommandEnvironment VariableDescription
--modelBENCH_MODELModel(s) to evaluate.
--model-base-urlBENCH_MODEL_BASE_URLBase URL for model(s).
--model-roleBENCH_MODEL_ROLEMap role(s) to specific models.
Some providers may support additional parameters. See provider-specific model configuration for extended features.

Task Configuration

CLI CommandEnvironment VariableDescription
--epochsBENCH_EPOCHSNumber of epochs to run each evaluation.
--epochs-reducerBENCH_EPOCHS_REDUCERReducer(s) applied when aggregating epoch scores.
--limitBENCH_LIMITLimit evaluated samples (single number or range).
--message-limitBENCH_MESSAGE_LIMITMaximum number of messages per sample.
--score / --no-scoreBENCH_SCOREGrade the benchmark, or skip scoring (can score later).
Use -T Flag for Other Task-Specific Arguments
(e.g. bench eval graphwalks --model ollama/llama3.1:70b -T task=parents)

Generation Settings

CLI CommandEnvironment VariableDescription
--temperatureBENCH_TEMPERATUREModel sampling temperature.
--top-pBENCH_TOP_PNucleus sampling.
--max-tokensBENCH_MAX_TOKENSMaximum tokens for model response.
--seedBENCH_SEEDRandom seed for deterministic generation.

Performance & Concurrency

CLI CommandEnvironment VariableDescription
--max-connectionsBENCH_MAX_CONNECTIONSMaximum number of parallel requests.
--max-subprocessesBENCH_MAX_SUBPROCESSESMaximum number of parallel subprocesses.
--max-tasksBENCH_MAX_TASKSMaximum number of tasks to run concurrently.
--timeoutBENCH_TIMEOUTTimeout per model API request (seconds).
--max-retriesBENCH_MAX_RETRIESMaximum number of retries for API requests.
--retry-on-errorBENCH_RETRY_ON_ERRORRetry samples on errors (set number of retries).
--fail-on-errorBENCH_FAIL_ON_ERRORFailure threshold for sample errors (percentage or count).
--no-fail-on-errorBENCH_NO_FAIL_ON_ERRORDo not fail evaluation if errors occur.

Logging & Output

CLI FlagEnvironment VariableDescription
--logfileBENCH_OUTPUTOutput file for results.
--log-formatBENCH_LOG_FORMATOutput logging format (eval / json).
--displayBENCH_DISPLAYDisplay type for evaluation progress.
--log-samples / --no-log-samplesBENCH_LOG_SAMPLESInclude or exclude detailed samples in logs.
--log-images / --no-log-imagesBENCH_LOG_IMAGESInclude or exclude base64 encoded images in logs.
--log-bufferBENCH_LOG_BUFFERNumber of samples to buffer before writing logs.
--log-dirBENCH_LOG_DIRDirectory for log files.

Sandbox & Execution

CLI FlagEnvironment VariableDescription
--sandboxBENCH_SANDBOXEnvironment to run evaluation (local or docker).
--sandbox-cleanup / --no-sandbox-cleanupBENCH_SANDBOX_CLEANUPCleanup sandbox environments after tasks (or skip cleanup).

Hub Integration

CLI FlagEnvironment VariableDescription
--hub-repoBENCH_HUB_REPOTarget Hub dataset repo for logs.
--hub-privateBENCH_HUB_PRIVATEPush Hub dataset as private.

Debugging & Inspection

CLI FlagEnvironment VariableDescription
--debugBENCH_DEBUGEnable debug mode with full stack traces.
--debug-errorsBENCH_DEBUG_ERRORSEnable debug mode for errors only.
--traceBENCH_TRACETrace message interactions with model.
--alphaBENCH_ALPHAAllow running experimental/alpha benchmarks.

Configuration Examples

Simple Custom Configuration

bench eval humaneval \
  --model openai/gpt-4o \
  --temperature 0.5 \
  --max-tokens 512 \
  --epochs 5 \
  --sandbox docker

Model Comparison

bench eval mmlu --limit 500 \
    --model groq/llama-3.3-70b \
    --model openai/gpt-4o \
    --model anthropic/claude-3-5-sonnet

Multi-Model Evaluation

bench eval simpleqa \
  --model-role candidate=groq/llama-3.3-70b \
  --model-role grader=openai/gpt-4o

Batch Configuration

bench eval --model openai/gpt-4.1 \
  math --limit 20 \
  mmlu --limit 20

Interruption Retry Configuration

bench eval-retry logs/incomplete.json \
  --max-retries 3 \
  --retry-on-error

Display Configuration

# Display modes
export BENCH_DISPLAY="rich"           # Rich terminal output (default)
export BENCH_DISPLAY="plain"          # Simple text output
export BENCH_DISPLAY="none"           # No output (logs only)
export BENCH_DISPLAY="conversation"   # Show full conversations

bench eval mmlu --display conversation