Configuration Methods
openbench can be configured through multiple methods, in order of precedence:- Command-Line Arguments
- Environment Variables
- Configuration Files (
.envfiles)
.env
Configuration Parameters
Model Configuration
| CLI Command | Environment Variable | Description |
|---|---|---|
--model | BENCH_MODEL | Model(s) to evaluate. |
--model-base-url | BENCH_MODEL_BASE_URL | Base URL for model(s). |
--model-role | BENCH_MODEL_ROLE | Map role(s) to specific models. |
Some providers may support additional parameters. See provider-specific model configuration for extended features.
Task Configuration
| CLI Command | Environment Variable | Description |
|---|---|---|
--epochs | BENCH_EPOCHS | Number of epochs to run each evaluation. |
--epochs-reducer | BENCH_EPOCHS_REDUCER | Reducer(s) applied when aggregating epoch scores. |
--limit | BENCH_LIMIT | Limit evaluated samples (single number or range). |
--message-limit | BENCH_MESSAGE_LIMIT | Maximum number of messages per sample. |
--score / --no-score | BENCH_SCORE | Grade the benchmark, or skip scoring (can score later). |
-T Flag for Other Task-Specific Arguments(e.g.
bench eval graphwalks --model ollama/llama3.1:70b -T task=parents)
Generation Settings
| CLI Command | Environment Variable | Description |
|---|---|---|
--temperature | BENCH_TEMPERATURE | Model sampling temperature. |
--top-p | BENCH_TOP_P | Nucleus sampling. |
--max-tokens | BENCH_MAX_TOKENS | Maximum tokens for model response. |
--seed | BENCH_SEED | Random seed for deterministic generation. |
Performance & Concurrency
| CLI Command | Environment Variable | Description |
|---|---|---|
--max-connections | BENCH_MAX_CONNECTIONS | Maximum number of parallel requests. |
--max-subprocesses | BENCH_MAX_SUBPROCESSES | Maximum number of parallel subprocesses. |
--max-tasks | BENCH_MAX_TASKS | Maximum number of tasks to run concurrently. |
--timeout | BENCH_TIMEOUT | Timeout per model API request (seconds). |
--max-retries | BENCH_MAX_RETRIES | Maximum number of retries for API requests. |
--retry-on-error | BENCH_RETRY_ON_ERROR | Retry samples on errors (set number of retries). |
--fail-on-error | BENCH_FAIL_ON_ERROR | Failure threshold for sample errors (percentage or count). |
--no-fail-on-error | BENCH_NO_FAIL_ON_ERROR | Do not fail evaluation if errors occur. |
Logging & Output
| CLI Flag | Environment Variable | Description |
|---|---|---|
--logfile | BENCH_OUTPUT | Output file for results. |
--log-format | BENCH_LOG_FORMAT | Output logging format (eval / json). |
--display | BENCH_DISPLAY | Display type for evaluation progress. |
--log-samples / --no-log-samples | BENCH_LOG_SAMPLES | Include or exclude detailed samples in logs. |
--log-images / --no-log-images | BENCH_LOG_IMAGES | Include or exclude base64 encoded images in logs. |
--log-buffer | BENCH_LOG_BUFFER | Number of samples to buffer before writing logs. |
--log-dir | BENCH_LOG_DIR | Directory for log files. |
Sandbox & Execution
| CLI Flag | Environment Variable | Description |
|---|---|---|
--sandbox | BENCH_SANDBOX | Environment to run evaluation (local or docker). |
--sandbox-cleanup / --no-sandbox-cleanup | BENCH_SANDBOX_CLEANUP | Cleanup sandbox environments after tasks (or skip cleanup). |
Hub Integration
| CLI Flag | Environment Variable | Description |
|---|---|---|
--hub-repo | BENCH_HUB_REPO | Target Hub dataset repo for logs. |
--hub-private | BENCH_HUB_PRIVATE | Push Hub dataset as private. |
Debugging & Inspection
| CLI Flag | Environment Variable | Description |
|---|---|---|
--debug | BENCH_DEBUG | Enable debug mode with full stack traces. |
--debug-errors | BENCH_DEBUG_ERRORS | Enable debug mode for errors only. |
--trace | BENCH_TRACE | Trace message interactions with model. |
--alpha | BENCH_ALPHA | Allow running experimental/alpha benchmarks. |