Available Benchmarks
MMLU
Massive Multitask Language Understanding across 57 subjects including STEM, humanities, and social sciences.
MMLU-Pro
Enhanced version of MMLU with more challenging questions and additional subjects.
GPQA Diamond
Graduate-level science questions (PhD-level) in physics, chemistry, and biology.
SuperGPQA
Extended graduate-level question answering spanning 285 academic disciplines.
TUMLU
Turkish Understanding and Multitask Language Understanding across 9 languages.
OpenBookQA
Question answering requiring multi-step reasoning with elementary science knowledge.
HLE
Humanity’s Last Exam - 2,500 expert-written questions from 1,000+ domain experts across diverse fields.
HLE Text
Text-only version of Humanity’s Last Exam without visual components.