Skip to main content

Documentation Index

Fetch the complete documentation index at: https://openbench.dev/llms.txt

Use this file to discover all available pages before exploring further.

Overview

The bench eval command is the core of openbench, allowing you to evaluate any supported model on any available benchmark.

Usage

bench eval <benchmark> [options]

Arguments

ArgumentDescriptionRequired
benchmarkName of the benchmark to runYes

Basic Configuration Options

Commonly used configuration options for model selection, evaluation control, performance optimization:
OptionDescription
--modelModel to evaluate
--limitNumber of questions to evaluate
--epochsNumber of evaluation rounds
--temperatureSampling temperature
--top-pNucleus sampling
--seedRandom seed for reproducibility
--message-limitMax messages per sample
--max-tokensMaximum response tokens
--max-connectionsConcurrent API calls
--max-subprocessesParallel subprocesses
--max-tasksMaximum number of tasks to run concurrently
--fail-on-errorFailure threshold for sample errors
--timeoutRequest timeout (seconds)
--sandboxContainer for running generated code
For full collection of eval congifuration parameters, see Configuration.

Example Configured Evaluation

Basic Evaluation

bench eval humaneval \
  --model openai/gpt-4o \
  --temperature 0.5 \
  --max-tokens 512 \
  --epochs 5 \
  --max-connections 5 \
  --timeout 120 \
  --sandbox docker