Skip to main content
Evaluating models and agents equipped with search tools poses a challenge due to the inherently dynamic nature of real-time data retrieval. To perform well, search-enabled AI needs to be able to parse ambiguous queries, complete single and multi hop queries, and judge between contrasting sources. Openbench provides standardized, reproducible implementations of leading search evaluation suites spanning adversarial multi-hop reasoning, large-scale factual recall, human-curated verification, and end-to-end research workflows.

Search Benchmarks

SealQA

Evaluates search-augmented language models on fact-seeking questions with noisy web results. Seal_0, includes the most challenging queries (near-zero accuracy).
bench eval sealqa -T subset=seal_0
Seal_Hard expands seal_0 to more broadly test factual accuracy and reasoning.
bench eval sealqa -T subset=seal_hard
Long Seal tests sustained retrieval and reasoning over extended contexts.
bench eval sealqa -T subset=longseal

SimpleQA

A collection of over 4000 factual questions spanning topics ranging from TV show trivia to scientific history.
bench eval simpleqa
SimpleQA Verified is Google DeepMind’s manually verified upgrade to SimpleQA, providing a more topically balanced, de-duplicated, and accurately labeled dataset for assessing factual accuracy in search tasks.
bench eval simpleqa_verified

DeepResearch Bench

End-to-end research missions that grade planning, browsing, note-taking, and citation hygiene.
bench eval deepresearch

BrowseComp

Challenging queries requiring persistent web browsing/navigation to find obscure and entangled information.
bench eval browsecomp