tokentrace is a token-level observability library for AI inference systems. It gives you structured visibility into every inference call: what model ran, how many tokens it consumed, what it cost, how long it took, and — when combined with eval scores from matchspec — how good the output was. Metrics are aggregated continuously and exposed via an HTTP API and Prometheus endpoint. Alert rules fire when cost, latency, or quality moves outside acceptable bounds.
Standard application performance monitoring tools track three signals: availability (is the endpoint up?), latency (how long does it take?), and error rate (what fraction of requests return non-2xx?). These signals were designed for deterministic systems where an HTTP 200 means the function ran correctly.
Language models break this assumption. A model call can return HTTP 200 in 400 ms with a hallucinated citation, a misunderstood instruction, or a response that confidently answers the wrong question. None of that is visible in your existing dashboards. Your error rate is 0%. Your latency looks fine. Your users are getting bad answers.
The gap is structural, not a tooling shortcoming. Traditional APM has no concept of output quality because traditional APIs produce deterministic outputs — a /users/{id} endpoint either returns the right user record or it doesn’t, and you can verify that in a test. LLMs produce probabilistic outputs. What “correct” means depends on context, and correctness can degrade silently as models are updated, prompts drift, or input distributions shift.
tokentrace addresses this in three ways:
Structured token traces — Every inference call emits a span with token counts, cost, and latency as first-class fields, not buried in log strings. This makes cost and latency directly queryable.
Custom attributes — Spans accept arbitrary key/value metadata. Attach an eval score from matchspec, a session ID, a workflow name, or a content category. Any attribute can be aggregated into a metric or used in an alert rule.
Quality metrics — By attaching eval scores to spans, quality becomes a time-series metric like any other. You can track quality over time, correlate it with model version changes, and alert on drops.
A Span represents a single inference call. Every span records:
gpt-4o, claude-3-5-sonnet-20241022)openai, anthropic, bedrock)ok, error, or timeoutoktokentrace.Start() for correlating spans in a multi-step workflowtokentrace has three layers:
Span → Transport → Sink
A Span is created when you call trace.Record(). The Tracer passes it to the configured Transport, which serializes it and delivers it to a sink. The sink might be a local file, an HTTP endpoint, stdout, or a combination of all three via MultiTransport.
Separately, a metrics aggregator consumes spans in the background. It maintains running counters and histograms for all built-in metrics (cost, tokens, latency, quality score, error rate). The HTTP API and Prometheus endpoint read from this aggregator.
Alert rules run on a timer against the metrics aggregator. When a rule fires, the configured delivery channel (HTTP webhook or stdout) receives an alert payload.
Your Go code
│
▼
tokentrace.Start()
│
▼
Span{...} ──► Transport ──► File / HTTP / Stdout
│
▼
Aggregator ──► Metrics API ──► Prometheus / Grafana
│
▼
Alert engine ──► Webhook / Stdout
The aggregator is in-process. There is no background daemon, no sidecar, and no separate collector to deploy. For high-throughput production systems, the HTTP transport batches spans and delivers them asynchronously; the metrics aggregation and alert evaluation happen on the receiving end, not in the application process.
tokentrace is built on mist-go, the shared core library for the MIST stack. It uses:
Transport interface, retry logic, batching, and HTTP deliveryYou do not need to interact with mist-go directly to use tokentrace. The zero-dependency guarantee holds: the tokentrace binary has no runtime dependencies beyond the Go standard library.
Cost visibility — Know exactly what your AI features cost per call, per user, and per model. Set budget alerts before you exceed a spending threshold.
Latency monitoring — Track p50, p95, and p99 latency for each model and workflow. Detect regressions when providers degrade or when input length trends upward.
Quality tracking — Attach matchspec eval scores to spans. Track quality as a time-series metric. Alert on drops before users file support tickets.
Model comparison — Run two models behind a feature flag, attribute spans to each variant, and compare cost, latency, and quality directly in Grafana.
Agent debugging — Instrument each step of a multi-step agent loop with a child span. Trace the full token cost of a complex workflow. Identify which step is the latency bottleneck.
Budget enforcement — Set a hard cost limit per workflow. tokentrace’s Budget helper accumulates span costs and returns an error when the limit is reached, stopping the agent loop before it overspends.