LLM inference costs add up fast. A production system handling 100k requests per day at $0.002 average per request is $200/day — $6,000/month. infermux gives you three levers to control this: routing strategy (send requests to the cheapest eligible provider), model routing (use expensive models only where they matter), and budgets (cap per-caller spend before it reaches the provider).
The cost_weighted strategy selects the provider expected to minimize cost for each request:
routing:
strategy: cost_weighted
Cost is estimated before the request is sent using the prompt token count and the pricing table. Completion tokens are unknown before the response, so the estimate uses only prompt tokens for selection. This is a reasonable approximation: prompt tokens account for most of the cost on long-context requests, and for short prompts the absolute dollar difference between providers is small.
With three providers configured — OpenAI, Anthropic, and Ollama (local) — the cost-weighted strategy will route to Ollama for any model that has a zero-cost entry in the pricing table, then to the cheapest hosted provider.
The most impactful optimization is using a cheap model for tasks that don’t need a powerful one. Use route groups to send different model tiers to different providers and strategies:
routing:
strategy: priority # fallback for unmatched requests
groups:
# High-volume simple tasks: cost-optimized, cheap models
- name: fast
models:
- gpt-4o-mini
- claude-haiku-3-5
strategy: cost_weighted
providers: [openai, anthropic, groq]
# Complex reasoning: priority routing, expensive models
- name: reasoning
models:
- gpt-4o
- o1
- claude-opus-4-5
strategy: priority
providers: [openai, anthropic]
# Embeddings: always route to cheapest available
- name: embeddings
models:
- text-embedding-3-small
strategy: cost_weighted
providers: [openai]
With this config, simple chat requests (gpt-4o-mini) are routed cost-optimally across three providers. Complex reasoning requests always go to OpenAI first, falling back to Anthropic. Embeddings are handled separately and always use the cheapest available option.
Consider a summarization pipeline that processes 50,000 documents per day. Each document averages 2,000 input tokens and 200 output tokens.
Without cost optimization — all requests to OpenAI GPT-4o:
With cost optimization — route summarization to GPT-4o-mini:
That is a 94% cost reduction on this workload. The tradeoff is output quality — which you verify with matchspec on a holdout set before switching.
With local Ollama — route to llama3.2 locally:
In practice a realistic optimization is: route 80% of traffic (simple queries, short contexts) to a cheap hosted model or local model, and keep 20% (long contexts, complex reasoning, high-stakes outputs) on GPT-4o. If your average cost before optimization is $0.002/request, a routing strategy that achieves 80/20 splits gets you to roughly $0.0005/request — a 75% reduction.
For multi-tenant systems where different callers should have different spend caps:
budgets:
state_file: "/var/lib/infermux/budgets.json"
- caller: "free-tier-user"
monthly_usd: 1.00
action: hard_stop
- caller: "pro-user"
monthly_usd: 20.00
action: alert
- caller: "enterprise-user"
monthly_usd: 500.00
action: alert
- caller: "batch-pipeline"
monthly_usd: 100.00
action: model_override
model_override: "gpt-4o-mini" # downgrade instead of rejecting
The model_override action is particularly useful for batch pipelines: when the budget nears the limit, infermux transparently downgrades to a cheaper model rather than failing the request. The caller receives a response with X-Infermux-Model-Override: gpt-4o-mini to signal the substitution.
Pass the caller identity in the request header:
curl http://localhost:8080/v1/chat/completions \
-H "X-Infermux-Caller: free-tier-user" \
-H "Content-Type: application/json" \
-d '{"model": "gpt-4o", "messages": [...]}'
When the free-tier-user’s $1.00 monthly budget is exhausted:
HTTP/1.1 429 Too Many Requests
X-Infermux-Error: budget_exceeded
{
"error": {
"message": "monthly budget of $1.00 exceeded; current spend: $1.003",
"type": "infermux_error",
"code": "budget_exceeded"
}
}
The management API gives you current-month cost data broken down multiple ways:
# Total by provider
curl http://localhost:8081/_infermux/costs
# By caller — see which callers are driving spend
curl http://localhost:8081/_infermux/costs/by-caller
# By model — identify expensive model usage
curl http://localhost:8081/_infermux/costs/by-model
# Hourly time series for the last 24 hours
curl "http://localhost:8081/_infermux/costs/timeseries?hours=24&bucket=1h"
Example /costs/by-caller output:
{
"period": "2026-03",
"callers": [
{
"caller": "batch-pipeline",
"total_usd": 84.20,
"budget_usd": 100.00,
"budget_remaining_usd": 15.80,
"requests": 42100,
"prompt_tokens": 84200000,
"completion_tokens": 8420000
},
{
"caller": "user-abc123",
"total_usd": 0.83,
"budget_usd": 1.00,
"budget_remaining_usd": 0.17,
"requests": 415
}
]
}
When running the full MIST stack, infermux emits cost events as mist-go metrics that tokentrace collects. This gives you cost data in your existing observability stack alongside latency and quality metrics:
# tokentrace.yml
sources:
- type: infermux
address: "http://localhost:8080"
alerts:
- name: daily-spend-spike
metric: infermux.cost_usd
condition: "rate_1h > 15" # alert if hourly spend exceeds $15
severity: warning
With this setup, you get spend data in Grafana dashboards, time-series cost queries via tokentrace’s HTTP API, and alerts when spend spikes unexpectedly.