Skip to main content

Overview

The bench eval command is the core of openbench, allowing you to evaluate any supported model on any available benchmark.

Usage

bench eval <benchmark> [options]

Arguments

ArgumentDescriptionRequired
benchmarkName of the benchmark to runYes

Basic Configuration Options

Commonly used configuration options for model selection, evaluation control, performance optimization:
OptionDescription
--modelModel to evaluate
--limitNumber of questions to evaluate
--epochsNumber of evaluation rounds
--temperatureSampling temperature
--top-pNucleus sampling
--seedRandom seed for reproducibility
--message-limitMax messages per sample
--max-tokensMaximum response tokens
--max-connectionsConcurrent API calls
--max-subprocessesParallel subprocesses
--max-tasksMaximum number of tasks to run concurrently
--fail-on-errorFailure threshold for sample errors
--timeoutRequest timeout (seconds)
--sandboxContainer for running generated code
For full collection of eval congifuration parameters, see Configuration.

Example Configured Evaluation

Basic Evaluation

bench eval humaneval \
  --model openai/gpt-4o \
  --temperature 0.5 \
  --max-tokens 512 \
  --epochs 5 \
  --max-connections 5 \
  --timeout 120 \
  --sandbox docker