Skip to main content
openbench provides comprehensive reasoning benchmarks covering factuality, logical reasoning, multi-hop inference, and multimodal understanding across diverse domains.

Available Benchmarks

SimpleQA

Tests factuality and accuracy on straightforward questions with verifiable answers.
bench eval simpleqa

MuSR

Multi-Step Reasoning benchmark with murder mysteries, object placements, and team allocation problems.
bench eval musr

DROP

Discrete Reasoning Over Paragraphs - numerical and span-based reasoning over text.
bench eval drop

GraphWalks

Multi-hop reasoning through graph structures to test navigation and inference.
bench eval graphwalks

BrowseComp

Web browsing agent tasks requiring navigation and information synthesis.
bench eval browsecomp

MMMU

Massive Multi-discipline Multimodal Understanding across college-level subjects.
bench eval mmmu

MMMU Pro

Enhanced version of MMMU with more challenging multimodal problems.
bench eval mmmu_pro