Skip to main content

Welcome to openbench!

openbench is an open-source framework for standardized, reproducible benchmarking of large language models (LLMs). Our goal is to make evaluation both rigorous and accessible:
  • Run industry-standard benchmarks easily on any model, wherever it’s hosted.
  • Design and run evaluations tailored to your specific needs.
  • Choose from 30+ evaluation suites spanning knowledge, reasoning, coding, mathematics, and more.
With openbench, you can build trust in model performance through transparent, reproducible, and domain-relevant evaluation.

What’s New in v0.5

ARC-AGI (with ARC Prize), plugins for external benchmarks, OpenRouter routing, code agents + Exercism, LiveMCPBench tool-calling, MultiChallenge, JSON logs — see the release notes.

Quick Start

Start Using openbench →

Install openbench and run your first benchmark in < 60 seconds.

Key Features

Works with Any Model Provider

openbench supports 15+ model providers out of the box.

Groq

Blazing fast inference
groq/llama-3.3-70b

OpenAI

GPT-4, o3, and more
openai/gpt-4o

Anthropic

Claude Sonnet & Opus
anthropic/claude-3-5-sonnet

Google

Gemini models
google/gemini-2.5-pro

OpenRouter

Unified LLM interface
openrouter/deepseek/deepseek-chat-v3.1

15+ More

AWS Bedrock, Azure, Cohere, Together, and more.See a complete list of supported model providers.

Join the Community

GitHub Repository

Star us on GitHub and contribute to the project!

Report Issues

Found a bug or have a feature request? Let us know!

Stay Updated

We are rapidly iterating! Sign up below to recieve updates about latest openbench features.