Available Benchmarks
SimpleQA
Tests factuality and accuracy on straightforward questions with verifiable answers.
MuSR
Multi-Step Reasoning benchmark with murder mysteries, object placements, and team allocation problems.
DROP
Discrete Reasoning Over Paragraphs - numerical and span-based reasoning over text.
GraphWalks
Multi-hop reasoning through graph structures to test navigation and inference.
BrowseComp
Web browsing agent tasks requiring navigation and information synthesis.
MMMU
Massive Multi-discipline Multimodal Understanding across college-level subjects.
MMMU Pro
Enhanced version of MMMU with more challenging multimodal problems.