Skip to main content

MMLU

57 subjects · Multiple choice · Knowledge & reasoningMassive Multitask Language Understanding - 57 academic subjects from the cais/mmlu dataset

GPQA Diamond

Graduate-level · Science · Multiple choiceGraduate-level Google-Proof Q&A in biology, chemistry, and physics

LiveMCPBench

MCP · Agent · Tool Callingevaluates how effectively language models can navigate and utilize the Model Context Protocol (MCP) ecosystem.

SimpleQA

Factuality · Model-GradedMeasuring short-form factuality in large language models with simple Q&A pairs

Exercism

Agent · Code generationReal world code agent programming challenges across 5 languages

GraphWalks

Graphs · Long-context · ReasoningMulti-hop reasoning on graphs - both BFS and parent finding tasks

Complete Benchmark Catalog

Grader Information

Some benchmarks use a grader model to score the model’s performance. This requires an additional API key for the grader model. To run these benchmarks, you’ll need to export your OPENAI_API_key:
export OPENAI_API_KEY=your_openai_key
The following benchmarks use a grader model:
BenchmarkDefault Grader Model
simpleqaopenai/gpt-4.1-2025-04-14
hleopenai/o3-mini-2025-01-31
hle_textopenai/o3-mini-2025-01-31
browsecompopenai/gpt-4.1-2025-04-14
healthbenchopenai/gpt-4.1-2025-04-14
mathopenai/gpt-4-turbo-preview
math_500openai/gpt-4-turbo-preview
detailbenchgpt-5-mini-2025-08-07
livemcpbenchopenai/gpt-4.1-mini-2025-04-14
otis_mock_aimeopenai/gpt-4.1-mini-2025-04-14
political_evenhandednessopenai/gpt-4.1-2025-04-14