Skip to main content
openbench provides a diverse set of coding benchmarks to assess model capabilities in code generation, problem solving, and software engineering tasks across multiple programming languages.

Available Benchmarks

HumanEval

164 hand-written programming problems testing function-level code generation capabilities.
bench eval humaneval

MBPP

Mostly Basic Programming Problems - entry-level Python programming challenges.
bench eval mbpp

SciCode

Scientific computing problems requiring domain knowledge and programming skills. (Alpha)
bench eval scicode --alpha

GMCQ

Graduate-level multiple-choice questions on computer science fundamentals.
bench eval gmcq

JSONSchemaBench

Tests ability to generate valid JSON outputs conforming to specific schemas.
bench eval jsonschemabench

Exercism

Real-world coding tasks as an agent evaluation across 5 programming languages.
bench eval exercism