Documentation Index
Fetch the complete documentation index at: https://openbench.dev/llms.txt
Use this file to discover all available pages before exploring further.
Implement MCQEval
Our built-in MCQEval framework serves as an evaluation task factory. To implement, define two components and let openbench take care of the rest!
Component 1: Map Your Dataset to MCQ Samples
Define a mapping for dataset records to fit the expected MCQSample format:
class MCQSample(Sample):
input: str | list[ChatMessage]
target: str # must be a single uppercase letter (e.g. "A")
def record_to_mcq_sample(record) -> MCQSample:
"""Convert an OpenBookQA record to an openbench MCQSample."""
question = record["question_stem"]
options = [choice for choice in record["choices"]["text"]]
prompt = create_dynamic_multiple_choice_prompt(question, options)
return MCQSample(
input=prompt,
target=record["answerKey"],
id=record.get("id"),
metadata={"choice_labels": record["choices"]["label"]},
)
Component 2: Define Your Eval Task
Given a record_to_mcq_sample function, MCQEval will spin up an InspectAI Task:
@task
def openbookqa(split: str = "validation") -> Task:
"""OpenBookQA multiple choice science question evaluation (MCQ Abstracted)."""
valid_splits = ["train", "validation", "test"]
if split not in valid_splits:
raise ValueError(f"Invalid split '{split}'. Must be one of {valid_splits}")
return MCQEval(
# required fields
name="openbookqa",
dataset_path="allenai/openbookqa",
record_to_mcq_sample=record_to_mcq_sample,
split=split,
)
And that’s it! After registering your new MCQ task in the registry, you can run it with bench eval!
Additional Info
MCQEval accepts additional configuration parameters:
def MCQEval(
*,
name: str, # Task name
dataset_type: str = "hf" # "hf", "csv", or "json"
dataset_path: str, # Hugging Face dataset path/name
record_to_mcq_sample, # Function converting a raw record into an `MCQSample`
split: str, # HF dataset split (e.g., "train", "validation", "test")
auto_id: bool = True, # Auto-generate IDs for samples when true
subset_name: Optional[str] = None, # Dataset subset to load
group_keys: Optional[List[str]] = None, # Optional metadata keys to group reported metrics
additional_metrics: Optional[List[Any]] = None, # Optional additional metrics
prompt_template: Optional[str] = None, # Optional system prompt
config: Optional[GenerateConfig] = None, # Optional model `GenerateConfig`
epochs: Optional[Epochs] = None, # Optional `Epochs` to repeat samples and reduce scores
dataset_kwargs: Optional[dict[str, Any]] = None, # Additional dataset-specific parameters
) -> Task
When called, MCQEval loads the dataset type (default hf) according to the provided mapping function, e.g.,
dataset = hf_dataset(
dataset_path,
split=split,
sample_fields=record_to_mcq_sample,
auto_id=auto_id,
name=subset_name, # subset name
**(dataset_kwargs or {}),
)
… defines a basic generation solver,
solver = [generate()]
if prompt_template:
solver = [system_message(prompt_template), generate()]
… creates a dynamic mcq scorer,
scorer = create_mcq_scorer(
group_keys=group_keys,
additional_metrics=additional_metrics,
)()
… and returns a packaged evaluation task.
return Task(
name=name,
dataset=dataset,
solver=solver,
scorer=scorer,
config=config if config else GenerateConfig(),
epochs=epochs,
)