A dataset is a collection of examples. Each example has an input (what you send to the model), an expected output (what you want back), and optional metadata. Datasets are the foundation of every eval run — they determine what gets tested and what “correct” means.
The most common way to define a dataset is in a YAML file:
version: 1
name: summarization-v2
description: "Summarization eval covering science, policy, and health domains."
examples:
- id: ex-001
input: |
Summarize in one sentence:
Researchers developed a method that reduces neural network
training compute by 40% using structured pruning.
expected: "Researchers reduced neural network training compute by 40% with structured pruning."
metadata:
source: "arxiv:2024.12345"
difficulty: easy
tags:
- science
- ml
- id: ex-002
input: |
Summarize in one sentence:
The city council voted 7-2 to approve 8-story residential
buildings downtown, reversing a 1987 height cap of 4 stories.
expected: "The city council approved taller downtown buildings, reversing a decades-old restriction."
tags:
- policy
- id: ex-003
input: |
Summarize in one sentence:
A trial of 10,000 patients found the new drug cut readmissions
by 23% with no significant adverse events.
expected: "A large trial found the new drug reduced hospital readmissions by 23% safely."
tags:
- health
| Field | Type | Required | Description |
|---|---|---|---|
version |
integer | yes | Schema version. Must be 1. |
name |
string | yes | Identifier for this dataset. Used in reports. |
description |
string | no | Human-readable description. |
examples |
array | yes | List of examples. |
| Field | Type | Required | Description |
|---|---|---|---|
id |
string | yes | Unique identifier within the dataset. Used in per-example results. |
input |
string | yes | The input to send to the model. Can be a prompt, a JSON payload, or any string. |
expected |
string | yes | The expected model output. Graders compare the actual output against this value. |
metadata |
object | no | Arbitrary key-value pairs attached to the example. Available to graders and in results. |
tags |
array of strings | no | Labels for filtering. Run evals only on examples with specific tags using --tags. |
You can also define datasets entirely in Go:
package evals
import "github.com/greynewell/matchspec"
var SummarizationDataset = matchspec.Dataset{
Name: "summarization-v2",
Description: "Summarization eval covering science, policy, and health.",
Examples: []matchspec.Example{
{
ID: "ex-001",
Input: "Summarize in one sentence: Researchers reduced neural network training compute by 40% using structured pruning.",
Expected: "Researchers cut neural network training compute by 40% with structured pruning.",
Tags: []string{"science", "ml"},
Metadata: map[string]any{
"source": "arxiv:2024.12345",
"difficulty": "easy",
},
},
{
ID: "ex-002",
Input: "Summarize in one sentence: The city council approved 8-story buildings downtown, reversing a 1987 cap.",
Expected: "The city council approved taller buildings downtown, overturning a decades-old height limit.",
Tags: []string{"policy"},
},
},
}
Go-defined datasets can be used directly in harnesses without any file I/O:
harness := matchspec.Harness{
Name: "summarization-v2",
Dataset: evals.SummarizationDataset,
// ...
}
This is useful when you want to co-locate datasets with the Go code that uses them, or when you want dataset examples to reference non-string inputs that YAML cannot express cleanly.
When configuring harnesses via YAML, reference datasets by path:
# harness.yml
dataset: ./dataset.yml
Relative paths are resolved from the directory containing the harness file. You can also use absolute paths or paths relative to the root of the project (where matchspec.yml lives).
To load a dataset from Go code:
ds, err := matchspec.LoadDatasetFile("./evals/summarization/dataset.yml")
if err != nil {
log.Fatal(err)
}
Datasets should be versioned alongside your prompts and model configs. Best practices:
summarization-v1, summarization-v2. When you add or modify examples in ways that change the comparison baseline, increment the version.go:embed to include dataset files in your binary so that CI workers don’t need to fetch them:import _ "embed"
//go:embed evals/summarization/dataset.yml
var datasetYAML []byte
func loadDataset() (matchspec.Dataset, error) {
return matchspec.ParseDatasetYAML(datasetYAML)
}
Run a suite against a subset of examples using --tags:
# Only run examples tagged "science"
matchspec run --tags science
# Run examples tagged "science" OR "ml"
matchspec run --tags science,ml
Tag filtering applies across all harnesses in the suite. Examples without matching tags are skipped, and the pass rate is computed only over the matching examples.
If you have logs of real model inputs and human-labeled outputs, you can seed a dataset from them. matchspec provides a matchspec.DatasetBuilder for this pattern:
builder := matchspec.NewDatasetBuilder("production-sample-v1")
for _, logEntry := range productionLogs {
if logEntry.HumanLabel != "" {
builder.Add(matchspec.Example{
ID: logEntry.RequestID,
Input: logEntry.Prompt,
Expected: logEntry.HumanLabel,
Metadata: map[string]any{
"timestamp": logEntry.Timestamp,
"user_segment": logEntry.UserSegment,
},
})
}
}
dataset := builder.Build()
// Optionally write to YAML for review and version control.
if err := matchspec.WriteDatasetFile(dataset, "./evals/production-sample-v1.yml"); err != nil {
log.Fatal(err)
}
Seeding from production logs is a powerful way to build representative datasets, but review the examples before committing them — production data may contain sensitive information or adversarial inputs that should not live in source control unmodified.
For datasets with thousands of examples, YAML files become unwieldy. matchspec supports JSON Lines (.jsonl) format, where each line is a JSON object representing one example:
{"id":"ex-001","input":"Summarize: ...","expected":"...","tags":["science"]}
{"id":"ex-002","input":"Summarize: ...","expected":"...","tags":["policy"]}
Load a JSONL dataset the same way:
ds, err := matchspec.LoadDatasetFile("./evals/large-dataset.jsonl")
matchspec detects the format from the file extension (.yml/.yaml for YAML, .jsonl for JSON Lines).