Featured
MMLU
57 subjects · Multiple choice · Knowledge & reasoningMassive Multitask Language Understanding - 57 academic subjects from the cais/mmlu dataset
GPQA Diamond
Graduate-level · Science · Multiple choiceGraduate-level Google-Proof Q&A in biology, chemistry, and physics
LiveMCPBench
MCP · Agent · Tool Callingevaluates how effectively language models can navigate and utilize the Model Context Protocol (MCP) ecosystem.
SimpleQA
Factuality · Model-GradedMeasuring short-form factuality in large language models with simple Q&A pairs
Exercism
Agent · Code generationReal world code agent programming challenges across 5 languages
GraphWalks
Graphs · Long-context · ReasoningMulti-hop reasoning on graphs - both BFS and parent finding tasks
Complete Benchmark Catalog
Grader Information
Some benchmarks use a grader model to score the model’s performance. This requires an additional API key for the grader model. To run these benchmarks, you’ll need to export yourOPENAI_API_key:
| Benchmark | Default Grader Model |
|---|---|
simpleqa | openai/gpt-4.1-2025-04-14 |
hle | openai/o3-mini-2025-01-31 |
hle_text | openai/o3-mini-2025-01-31 |
browsecomp | openai/gpt-4.1-2025-04-14 |
healthbench | openai/gpt-4.1-2025-04-14 |
math | openai/gpt-4-turbo-preview |
math_500 | openai/gpt-4-turbo-preview |
detailbench | gpt-5-mini-2025-08-07 |
livemcpbench | openai/gpt-4.1-mini-2025-04-14 |
otis_mock_aime | openai/gpt-4.1-mini-2025-04-14 |
political_evenhandedness | openai/gpt-4.1-2025-04-14 |