Search Benchmarks
SealQA
Evaluates search-augmented language models on fact-seeking questions with noisy web results. Seal_0, includes the most challenging queries (near-zero accuracy).Seal_Hard expands seal_0 to more broadly test factual accuracy and reasoning.Long Seal tests sustained retrieval and reasoning over extended contexts.
SimpleQA
A collection of over 4000 factual questions spanning topics ranging from TV show trivia to scientific history.SimpleQA Verified is Google DeepMind’s manually verified upgrade to SimpleQA, providing a more topically balanced, de-duplicated, and accurately labeled dataset for assessing factual accuracy in search tasks.
DeepResearch Bench
End-to-end research missions that grade planning, browsing, note-taking, and citation hygiene.
BrowseComp
Challenging queries requiring persistent web browsing/navigation to find obscure and entangled information.