Available Benchmarks
HumanEval
164 hand-written programming problems testing function-level code generation capabilities.
MBPP
Mostly Basic Programming Problems - entry-level Python programming challenges.
SciCode
Scientific computing problems requiring domain knowledge and programming skills. (Alpha)
GMCQ
Graduate-level multiple-choice questions on computer science fundamentals.
JSONSchemaBench
Tests ability to generate valid JSON outputs conforming to specific schemas.
Exercism
Real-world coding tasks as an agent evaluation across 5 programming languages.