Run Your First Eval Suite
This tutorial walks through every step of a working eval suite from scratch: install matchspec, write a dataset, add a grader, run the suite, and interpret the output. You’ll end with a project you can build on.
By the end you’ll have:
- A
matchspec.ymlconfig file - A 5-example dataset in YAML
- A harness with an
exact_matchgrader - A passing eval run
Prerequisites
- Go 1.21 or later installed
- A terminal
You do not need an API key or external service for this tutorial — we’ll use a stub model.
Create a project directory
mkdir my-eval-project && cd my-eval-project
go mod init my-eval-project
Install matchspec:
go get github.com/greynewell/matchspec
go install github.com/greynewell/matchspec/cmd/matchspec@latest
Verify:
matchspec --version
# matchspec v0.3.1
Write a dataset
Create the directory structure:
mkdir -p evals/qa
Create evals/qa/dataset.yml:
version: 1
name: qa-basic
description: "Simple factual Q&A dataset for testing."
examples:
- id: q1
input: "What is the capital of France?"
expected: "Paris"
tags: [geography]
- id: q2
input: "What is the chemical symbol for water?"
expected: "H2O"
tags: [science]
- id: q3
input: "How many days are in a standard year?"
expected: "365"
tags: [general]
- id: q4
input: "What programming language was Go created at?"
expected: "Google"
tags: [technology]
- id: q5
input: "What is the square root of 144?"
expected: "12"
tags: [math]
Each example has:
id— unique identifier, appears in reports when an example failsinput— what you send to the modelexpected— the correct answer you want the model to producetags— optional labels for filtering
Write a harness config
Create evals/qa/harness.yml:
version: 1
name: qa-basic
description: "Factual QA with exact match grader."
dataset: ./dataset.yml
model:
type: command
command: ["go", "run", "../../cmd/stub-model/main.go"]
input_via: stdin
graders:
- type: exact_match
name: exact_match
threshold: 0.80
config:
trim_whitespace: true
case_sensitive: false
We’re using a command model type that calls a local Go script. Let’s create that stub:
mkdir -p cmd/stub-model
Create cmd/stub-model/main.go:
package main
import (
"bufio"
"fmt"
"os"
"strings"
)
// Stub model: looks up known answers, returns the input for anything else.
var answers = map[string]string{
"what is the capital of france?": "Paris",
"what is the chemical symbol for water?": "H2O",
"how many days are in a standard year?": "365",
"what programming language was go created at?": "Google",
"what is the square root of 144?": "12",
}
func main() {
scanner := bufio.NewScanner(os.Stdin)
var lines []string
for scanner.Scan() {
lines = append(lines, scanner.Text())
}
input := strings.TrimSpace(strings.Join(lines, "\n"))
key := strings.ToLower(input)
if answer, ok := answers[key]; ok {
fmt.Print(answer)
} else {
// Unknown question — return something wrong.
fmt.Print("I don't know")
}
}
This is a deterministic stub. In a real project, the command model would call a script that invokes an actual LLM.
You can also use model: {type: http} to call an OpenAI-compatible endpoint. The command model is just the easiest way to experiment without an API key.
Create matchspec.yml
Create matchspec.yml in the project root:
version: 1
suites:
- name: qa
description: "Basic factual QA eval suite."
harnesses:
- ./evals/qa/harness.yml
thresholds:
overall: 0.80
Your directory structure should now look like this:
my-eval-project/
├── go.mod
├── matchspec.yml
├── cmd/
│ └── stub-model/
│ └── main.go
└── evals/
└── qa/
├── dataset.yml
└── harness.yml
Run the eval suite
matchspec run
You should see:
loading suite: qa
loading harness: qa-basic (5 examples)
running model (command): go run cmd/stub-model/main.go
scoring with: exact_match
suite: qa
──────────────────────────────
exact_match 1.00 ✓ (≥0.80)
──────────────────────────────
overall PASS
results written to: .matchspec/results/qa-20260315-143022.json
All 5 examples passed because our stub model returns the correct answer for every known question.
Check the exit code:
echo $?
# 0
Exit code 0 means all thresholds passed. This is what CI systems check.
See a failure
Let’s make the suite fail to understand what that looks like. Edit evals/qa/harness.yml and raise the threshold higher than our score:
graders:
- type: exact_match
name: exact_match
threshold: 0.99 # changed from 0.80 to 0.99
Now add a hard question to evals/qa/dataset.yml that the stub won’t know:
- id: q6
input: "Who wrote the Iliad?"
expected: "Homer"
tags: [literature]
Run again:
matchspec run
Output:
suite: qa
──────────────────────────────────────────
exact_match 0.83 ✗ (≥0.99) DELTA: -0.16
──────────────────────────────────────────
overall FAIL
Failed graders: exact_match
Pass rate 0.83 is below threshold 0.99 (delta: -0.16).
Failing examples:
q6: expected "Homer", got "I don't know"
echo $?
# 1
Exit code 1 means at least one threshold failed. Now you can see the structure of a failing report.
Reset the threshold back to 0.80 before continuing to see passing behavior.
Run with verbose output
Use --verbose to see per-example scores during the run:
matchspec run --verbose
[qa-basic] q1: exact_match=1.00 ✓
[qa-basic] q2: exact_match=1.00 ✓
[qa-basic] q3: exact_match=1.00 ✓
[qa-basic] q4: exact_match=1.00 ✓
[qa-basic] q5: exact_match=1.00 ✓
suite: qa
──────────────────────────────
exact_match 1.00 ✓ (≥0.80)
──────────────────────────────
overall PASS
Verbose mode is useful during development to spot which examples are failing before the aggregate report.
What you built
You created a complete eval pipeline with:
- A versioned YAML dataset with 5 labeled examples
- A harness that wires the dataset to a model and a grader
- A suite config with a pass/fail threshold
- A working
matchspec runcommand that exits 0 on pass, 1 on fail
Next steps
- Replace the stub model with a real LLM using
model.type: http - Add a
semantic_similaritygrader for more forgiving scoring — see Graders - Set up a GitHub Actions workflow to gate PRs on eval results — see CI/CD Integration
- Try the Gate Deployments with matchspec tutorial