HTTP API

infermux exposes two sets of endpoints: the inference API, which is OpenAI-compatible, and the management API, which provides health, status, and cost information.

Authentication

Inference endpoints (/v1/*): infermux can operate in two authentication modes.

Passthrough (default): the Authorization: Bearer <token> header from the caller is forwarded to the provider. If your application already manages API keys, this requires no additional configuration.
Static key: infermux validates incoming requests against a configured static key and uses its own provider credentials for outbound requests. This is the right model when you want to centralize API key management in infermux and not expose provider keys to callers.

auth:
  mode: static_key
  key: "${INFERMUX_API_KEY}"   # callers must send this as Bearer token

Management endpoints (/_infermux/*): Protected by INFERMUX_MANAGEMENT_TOKEN if set. Otherwise unauthenticated. Always bind the management listener to localhost or an internal network when running in production.

Inference API

POST /v1/chat/completions

Create a chat completion. This is the primary inference endpoint.

Request:

{
  "model": "gpt-4o-mini",
  "messages": [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "What is the capital of France?"}
  ],
  "temperature": 0.7,
  "max_tokens": 256,
  "stream": false
}

All standard OpenAI chat completion parameters are accepted and forwarded to the selected provider. Parameters that a provider does not support are silently dropped (for example, logprobs when routing to Anthropic).

Response:

{
  "id": "chatcmpl-abc123",
  "object": "chat.completion",
  "created": 1741046400,
  "model": "gpt-4o-mini",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "Paris."
      },
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 28,
    "completion_tokens": 3,
    "total_tokens": 31
  }
}

Response headers (infermux-specific):

Header	Description
`X-Infermux-Provider`	Name of the provider that served the request
`X-Infermux-Model`	Model name as understood by the provider
`X-Infermux-Strategy`	Routing strategy that selected the provider
`X-Infermux-Route-Group`	Route group that matched, if any
`X-Infermux-Latency-Ms`	Provider response time in milliseconds
`X-Infermux-Cost`	Estimated cost in USD
`X-Infermux-Prompt-Tokens`	Prompt token count
`X-Infermux-Completion-Tokens`	Completion token count
`X-Infermux-Model-Override`	Set when a model was substituted (e.g., budget downgrade)

Streaming:

Set "stream": true to receive a server-sent events (SSE) stream in the standard OpenAI format. infermux streams the response from the provider with minimal buffering. The cost and latency headers are included in the final 200 OK response headers.

curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "gpt-4o-mini", "messages": [{"role": "user", "content": "Count to five"}], "stream": true}'

POST /v1/completions

Text completions (legacy). Forwarded to providers that support it.

{
  "model": "gpt-3.5-turbo-instruct",
  "prompt": "The capital of France is",
  "max_tokens": 10
}

POST /v1/embeddings

Generate embeddings. Routed to embedding-capable providers.

{
  "model": "text-embedding-3-small",
  "input": "The quick brown fox"
}

Response:

{
  "object": "list",
  "data": [
    {
      "object": "embedding",
      "index": 0,
      "embedding": [0.0023, -0.0142, ...]
    }
  ],
  "model": "text-embedding-3-small",
  "usage": {
    "prompt_tokens": 5,
    "total_tokens": 5
  }
}

Management API

All management endpoints are prefixed with /_infermux/.

GET /_infermux/health

Liveness check. Returns 200 if the server is running, regardless of provider health.

{"status": "ok"}

GET /_infermux/status

Overall status including provider health summary.

{
  "version": "0.4.0",
  "uptime_seconds": 86400,
  "providers_healthy": 2,
  "providers_total": 3,
  "requests_total": 481200,
  "requests_last_minute": 847,
  "errors_last_minute": 12
}

GET /_infermux/providers

List all providers with health, circuit state, and routing metadata.

[
  {
    "name": "openai",
    "type": "openai",
    "healthy": true,
    "circuit": "closed",
    "error_rate": 0.014,
    "p95_latency_ms": 412,
    "requests_last_minute": 624,
    "models": ["gpt-4o", "gpt-4o-mini"],
    "rate_limit_headroom_rpm": 2876
  }
]

POST /_infermux/providers/{name}/circuit/open

Force a circuit open (remove provider from routing).

POST /_infermux/providers/{name}/circuit/close

Force a circuit closed (restore provider to routing, bypasses probe).

POST /_infermux/providers/{name}/circuit/reset

Reset circuit state and clear error rate statistics.

GET /_infermux/metrics

Prometheus-compatible metrics exposition. Compatible with any Prometheus scraper.

# HELP infermux_requests_total Total inference requests by provider and model
# TYPE infermux_requests_total counter
infermux_requests_total{provider="openai",model="gpt-4o-mini",status="ok"} 47821
infermux_requests_total{provider="anthropic",model="claude-haiku-3-5",status="ok"} 12043
infermux_requests_total{provider="openai",model="gpt-4o",status="error"} 14

# HELP infermux_latency_seconds Inference request latency
# TYPE infermux_latency_seconds histogram
infermux_latency_seconds_bucket{provider="openai",model="gpt-4o-mini",le="0.5"} 38400
infermux_latency_seconds_bucket{provider="openai",model="gpt-4o-mini",le="1.0"} 46200
infermux_latency_seconds_bucket{provider="openai",model="gpt-4o-mini",le="5.0"} 47800
infermux_latency_seconds_bucket{provider="openai",model="gpt-4o-mini",le="+Inf"} 47821
infermux_latency_seconds_sum{provider="openai",model="gpt-4o-mini"} 22841.4
infermux_latency_seconds_count{provider="openai",model="gpt-4o-mini"} 47821

# HELP infermux_cost_usd_total Total estimated cost in USD
# TYPE infermux_cost_usd_total counter
infermux_cost_usd_total{provider="openai",model="gpt-4o-mini"} 7.23
infermux_cost_usd_total{provider="anthropic",model="claude-haiku-3-5"} 9.61

# HELP infermux_circuit_state Circuit breaker state (0=closed, 1=half-open, 2=open)
# TYPE infermux_circuit_state gauge
infermux_circuit_state{provider="openai"} 0
infermux_circuit_state{provider="anthropic"} 2

GET /_infermux/costs

Aggregated cost report. See Cost Tracking for the full schema and query parameters.

Error responses

When infermux cannot route a request, it returns an OpenAI-format error response with an HTTP 4xx or 5xx status:

{
  "error": {
    "message": "no healthy providers available for model gpt-4o",
    "type": "infermux_error",
    "code": "no_healthy_providers"
  }
}

Code	HTTP Status	Meaning
`no_healthy_providers`	503	All eligible providers have open circuits or failed health checks
`model_not_found`	404	No provider is configured to serve the requested model
`budget_exceeded`	429	Caller’s monthly budget has been exhausted
`rate_limit`	429	All eligible providers are at their rate limit
`upstream_error`	502	The selected provider returned an unexpected error
`upstream_timeout`	504	The selected provider did not respond within the configured timeout

← Previous CLI Reference

Next → Configuration