Running evals in CI is the core use case for matchspec. This page covers GitHub Actions and GitLab CI workflows, dataset caching, PR comment reporting, and badge generation.
This workflow runs matchspec run on every push to main and every pull request. It fails the build if any suite falls below threshold.
# .github/workflows/eval.yml
name: Eval Gate
on:
push:
branches: [main]
pull_request:
branches: [main]
jobs:
eval:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Set up Go
uses: actions/setup-go@v5
with:
go-version: "1.22"
- name: Cache Go modules
uses: actions/cache@v4
with:
path: ~/go/pkg/mod
key: ${{ runner.os }}-go-${{ hashFiles('**/go.sum') }}
restore-keys: |
${{ runner.os }}-go-
- name: Install matchspec
run: go install github.com/greynewell/matchspec/cmd/matchspec@latest
- name: Run evals
env:
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
run: matchspec run --output junit --output-dir ./test-results
- name: Publish test results
uses: actions/upload-artifact@v4
if: always()
with:
name: eval-results
path: ./test-results/
The JUnit output format (--output junit) lets GitHub Actions display pass/fail results as test annotations on the PR.
This more complete workflow posts a markdown summary of eval results as a PR comment when the suite fails:
# .github/workflows/eval-with-comment.yml
name: Eval Gate with PR Comment
on:
pull_request:
branches: [main]
permissions:
pull-requests: write
contents: read
jobs:
eval:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Set up Go
uses: actions/setup-go@v5
with:
go-version: "1.22"
- name: Cache Go modules
uses: actions/cache@v4
with:
path: ~/go/pkg/mod
key: ${{ runner.os }}-go-${{ hashFiles('**/go.sum') }}
- name: Install matchspec
run: go install github.com/greynewell/matchspec/cmd/matchspec@latest
- name: Run evals
id: eval
env:
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
run: |
matchspec run \
--output markdown \
--output-dir ./eval-results \
--no-write=false
echo "exit_code=$?" >> $GITHUB_OUTPUT
continue-on-error: true
- name: Generate markdown report
id: report
run: |
REPORT=$(matchspec report \
--format markdown \
$(ls ./eval-results/*.json | head -1))
echo "report<<EOF" >> $GITHUB_OUTPUT
echo "$REPORT" >> $GITHUB_OUTPUT
echo "EOF" >> $GITHUB_OUTPUT
- name: Comment on PR
uses: actions/github-script@v7
if: github.event_name == 'pull_request'
with:
script: |
const report = `${{ steps.report.outputs.report }}`;
const body = `## Eval Results\n\n${report}\n\n*Run by [matchspec](https://miststack.dev/matchspec/)*`;
// Find existing comment to update
const comments = await github.rest.issues.listComments({
owner: context.repo.owner,
repo: context.repo.repo,
issue_number: context.issue.number,
});
const existing = comments.data.find(c =>
c.body.includes('## Eval Results') &&
c.user.type === 'Bot'
);
if (existing) {
await github.rest.issues.updateComment({
owner: context.repo.owner,
repo: context.repo.repo,
comment_id: existing.id,
body,
});
} else {
await github.rest.issues.createComment({
owner: context.repo.owner,
repo: context.repo.repo,
issue_number: context.issue.number,
body,
});
}
- name: Fail if evals failed
if: steps.eval.outputs.exit_code != '0'
run: exit 1
For large datasets, cache the downloaded data between workflow runs to avoid re-fetching on every CI run:
- name: Cache eval datasets
uses: actions/cache@v4
with:
path: ./evals/data
key: eval-datasets-${{ hashFiles('evals/**/*.yml') }}
restore-keys: |
eval-datasets-
If your datasets are embedded in the repository (checked in as YAML), caching is not needed — they’re available as part of the checkout. Caching is most useful when you seed datasets from external sources or use large JSONL files not committed to the repo.
Run evals nightly against the production model to catch regressions before they surface in PRs:
on:
schedule:
- cron: "0 2 * * *" # 2am UTC daily
workflow_dispatch: # allow manual trigger
Use tags to run a fast smoke check on every PR and a full eval only on merges to main:
on:
pull_request:
branches: [main]
push:
branches: [main]
jobs:
smoke:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Set up Go
uses: actions/setup-go@v5
with:
go-version: "1.22"
- run: go install github.com/greynewell/matchspec/cmd/matchspec@latest
- run: matchspec run --tags smoke
env:
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
full-eval:
runs-on: ubuntu-latest
if: github.event_name == 'push' && github.ref == 'refs/heads/main'
steps:
- uses: actions/checkout@v4
- name: Set up Go
uses: actions/setup-go@v5
with:
go-version: "1.22"
- run: go install github.com/greynewell/matchspec/cmd/matchspec@latest
- run: matchspec run --suite production-gate
env:
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
# .gitlab-ci.yml
stages:
- eval
eval-gate:
stage: eval
image: golang:1.22-alpine
variables:
GOPATH: $CI_PROJECT_DIR/.go
cache:
key: go-modules
paths:
- .go/pkg/mod/
before_script:
- go install github.com/greynewell/matchspec/cmd/matchspec@latest
- export PATH="$PATH:$GOPATH/bin"
script:
- matchspec run --output junit --output-dir ./test-results
artifacts:
when: always
reports:
junit: test-results/*.xml
paths:
- test-results/
rules:
- if: '$CI_PIPELINE_SOURCE == "merge_request_event"'
- if: '$CI_COMMIT_BRANCH == "main"'
Run matchspec as a Docker container in any CI environment:
# Dockerfile.eval
FROM golang:1.22-alpine AS builder
RUN go install github.com/greynewell/matchspec/cmd/matchspec@latest
FROM alpine:3.19
COPY --from=builder /root/go/bin/matchspec /usr/local/bin/matchspec
WORKDIR /eval
ENTRYPOINT ["matchspec"]
Use in CI:
docker build -f Dockerfile.eval -t matchspec:latest .
docker run --rm \
-v $(pwd):/eval \
-e OPENAI_API_KEY="$OPENAI_API_KEY" \
matchspec:latest run
matchspec generates a status badge SVG when you pass --badge:
matchspec run --badge ./public/eval-badge.svg
The badge shows the overall verdict (PASS/FAIL) and the lowest grader score. Host it from your project’s static assets and embed it in your README:

Model API keys must be available at eval run time. In GitHub Actions:
Settings → Secrets and variables → Actions → New repository secret${{ secrets.OPENAI_API_KEY }}matchspec runNever embed API keys in matchspec.yml or commit them to source control. The api_key_env field in harness configs reads the key from the environment at runtime.
text-embedding-3-small is 5x cheaper than text-embedding-3-large with minimal quality difference for most eval tasks.tags: [smoke] and run the smoke set on every PR, full set only on main.actions/cache step for ~/go/pkg/mod cuts install time from ~30s to ~5s on repeated runs.