Evaluating models and agents equipped with search tools poses a challenge due to the inherently dynamic nature of real-time data retrieval. To perform well, search-enabled AI needs to be able to parse ambiguous queries, complete single and multi hop queries, and judge between contrasting sources. Openbench provides standardized, reproducible implementations of leading search evaluation suites spanning adversarial multi-hop reasoning, large-scale factual recall, human-curated verification, and end-to-end research workflows.Documentation Index
Fetch the complete documentation index at: https://openbench.dev/llms.txt
Use this file to discover all available pages before exploring further.
Search Benchmarks
SealQA
Evaluates search-augmented language models on fact-seeking questions with noisy web results. Seal_0, includes the most challenging queries (near-zero accuracy).Seal_Hard expands seal_0 to more broadly test factual accuracy and reasoning.Long Seal tests sustained retrieval and reasoning over extended contexts.
SimpleQA
A collection of over 4000 factual questions spanning topics ranging from TV show trivia to scientific history.SimpleQA Verified is Google DeepMind’s manually verified upgrade to SimpleQA, providing a more topically balanced, de-duplicated, and accurately labeled dataset for assessing factual accuracy in search tasks.
DeepResearch Bench
End-to-end research missions that grade planning, browsing, note-taking, and citation hygiene.
BrowseComp
Challenging queries requiring persistent web browsing/navigation to find obscure and entangled information.