A grader takes a model output and an expected output and returns a score between 0.0 and 1.0. Scores are aggregated across all examples in a dataset to produce a pass rate, which is compared against a threshold to determine pass or fail.
matchspec ships with five built-in grader types. You can also implement the Grader interface to write your own.
Returns 1.0 if the model output exactly equals the expected output, 0.0 otherwise. Case-sensitive by default.
graders:
- type: exact_match
name: exact_match
threshold: 0.90
config:
case_sensitive: true # default: true
trim_whitespace: true # default: true
Use exact_match when model outputs should be deterministic and precisely correct — for classification labels, structured codes, or short factual answers. It is a high bar, and a pass rate of 0.90 with exact_match is a strong signal.
Returns 1.0 if the model output contains the expected string as a substring, 0.0 otherwise.
graders:
- type: contains
name: contains_answer
threshold: 0.85
config:
case_sensitive: false
Useful when you care that the answer is present but the model is allowed to include surrounding explanation. For example, if the expected output is "Paris" and the model outputs "The capital of France is Paris.", contains would score this 1.0 while exact_match would score it 0.0.
Evaluates the model output against a regular expression. Returns 1.0 if the output matches the pattern, 0.0 otherwise.
graders:
- type: regex
name: json_format
threshold: 0.95
config:
pattern: '^\{.*\}$'
flags: s # s = dot matches newline; other flags: i (case insensitive), m (multiline)
Use regex to validate structural properties of outputs — that the model returned valid JSON, that a phone number is in the right format, that a response starts with a capital letter.
You can reference the expected field in the pattern using {{expected}}:
graders:
- type: regex
name: contains_citation
config:
pattern: '(?i){{expected}}' # case-insensitive contains using the expected value as pattern
Computes the cosine similarity between the embedding of the model output and the embedding of the expected output. Returns a score between 0.0 and 1.0, where 1.0 is identical vectors.
graders:
- type: semantic_similarity
name: semantic_similarity
threshold: 0.82
config:
embedding_endpoint: "https://api.openai.com/v1/embeddings"
model: "text-embedding-3-small"
api_key_env: "OPENAI_API_KEY" # reads from environment variable
batch_size: 32 # examples per embedding request (default: 32)
timeout_seconds: 30 # per-request timeout (default: 30)
semantic_similarity is more forgiving than exact_match or contains — it captures whether the meaning is similar, not just whether the strings match. Use it for summarization, paraphrase, and open-ended generation tasks where multiple phrasings of the correct answer are acceptable.
Threshold guidance:
Calls a language model to evaluate the output on a rubric you define. Returns the score reported by the judge model, normalized to 0.0–1.0.
graders:
- type: llm_judge
name: helpfulness
threshold: 0.75
config:
endpoint: "https://api.openai.com/v1/chat/completions"
model: "gpt-4o"
api_key_env: "OPENAI_API_KEY"
prompt_template: |
You are an evaluator. Score the following response on helpfulness.
Question: {{input}}
Expected answer: {{expected}}
Model response: {{output}}
Score from 0 to 10, where 10 is perfectly helpful. Reply with only the number.
score_parser: integer_0_10 # built-in parser; or "float_0_1", or "custom"
timeout_seconds: 60
The prompt template has access to three variables:
{{input}} — the original input sent to the model under test{{expected}} — the expected output from the dataset{{output}} — the actual model output being gradedBuilt-in score parsers:
integer_0_10 — parses an integer 0–10 and divides by 10integer_0_5 — parses an integer 0–5 and divides by 5float_0_1 — parses a float between 0 and 1llm_judge is powerful but expensive and slow. Use it selectively on the subset of examples where cheaper graders are insufficient.
Run multiple graders on the same output and combine their scores with weights:
graders:
- type: semantic_similarity
name: semantic_similarity
weight: 0.7
threshold: 0.80
- type: exact_match
name: exact_match
weight: 0.3
threshold: 0.60
When weights are specified, the overall score for each example is a weighted average of the individual grader scores. The per-grader thresholds still apply — an example can pass the overall threshold while failing one grader’s individual threshold.
If no weights are specified, all graders are treated as independent pass/fail checks and each must independently meet its threshold.
Implement the Grader interface to write a grader in Go:
// Grader scores a single model output.
type Grader interface {
Name() string
Score(ctx context.Context, input, expected, output string) (Score, error)
}
// Score is the result of grading one example.
type Score struct {
Value float64 // 0.0–1.0
Passed bool // true if Value >= threshold
Metadata map[string]any // optional: attach reasoning, debug info, etc.
}
A complete custom grader example:
package graders
import (
"context"
"strings"
"github.com/greynewell/matchspec"
)
// WordOverlapGrader scores by the fraction of expected words present in the output.
type WordOverlapGrader struct {
threshold float64
}
func NewWordOverlapGrader(threshold float64) *WordOverlapGrader {
return &WordOverlapGrader{threshold: threshold}
}
func (g *WordOverlapGrader) Name() string { return "word_overlap" }
func (g *WordOverlapGrader) Score(ctx context.Context, input, expected, output string) (matchspec.Score, error) {
expectedWords := tokenize(expected)
outputWords := tokenize(output)
if len(expectedWords) == 0 {
return matchspec.Score{Value: 1.0, Passed: true}, nil
}
outputSet := make(map[string]bool, len(outputWords))
for _, w := range outputWords {
outputSet[w] = true
}
matches := 0
for _, w := range expectedWords {
if outputSet[w] {
matches++
}
}
score := float64(matches) / float64(len(expectedWords))
return matchspec.Score{
Value: score,
Passed: score >= g.threshold,
Metadata: map[string]any{
"expected_words": len(expectedWords),
"matched_words": matches,
},
}, nil
}
func tokenize(s string) []string {
// Simple lowercase word tokenizer.
s = strings.ToLower(s)
return strings.Fields(s)
}
Register your custom grader for use in YAML harness configs:
func init() {
matchspec.RegisterGrader("word_overlap", func(config map[string]any) (matchspec.Grader, error) {
threshold, _ := config["threshold"].(float64)
if threshold == 0 {
threshold = 0.70
}
return NewWordOverlapGrader(threshold), nil
})
}
After registration, you can reference the grader type in YAML:
graders:
- type: word_overlap
name: word_overlap
threshold: 0.75
config:
threshold: 0.75
For more on custom graders, including stateful graders and testing, see Custom Graders.
All grader configurations support these common fields:
| Field | Type | Required | Description |
|---|---|---|---|
type |
string | yes | Grader type identifier. |
name |
string | yes | Display name used in reports. Must be unique within a harness. |
threshold |
float | no | Per-grader pass threshold (0.0–1.0). Overrides the suite-level threshold for this grader. |
weight |
float | no | Weight for weighted composition. If any grader has a weight, all must have weights. |
config |
object | no | Grader-specific configuration. See per-grader sections above. |