Open‑source, K8s‑deployable AI Gateway for local & remote LLMs with quotas, streaming, routing, and observability.
- Overview
- Features
- Get Started
- Backends Overview
- API Overview
- Configuration
- Observability
- Quotas & Limits
- Deploy to Kubernetes
- Security
- CLI
pine-gate gives you a single, stable HTTP API in front of multiple LLMs (local and hosted). It handles authentication, rate limiting, usage counting, routing (including canaries), streaming responses, and telemetry so applications can focus on product logic rather than provider differences.
- Multiple model backends: use local and hosted LLMs behind one API. Echo for quick tests, Ollama for local models, vLLM via OpenAI‑compatible endpoints, plus OpenAI, OpenRouter, and Anthropic.
- Smart request routing: direct traffic by model rules or roll out changes safely with weighted canary splits.
- Real‑time streaming: stream tokens end‑to‑end over SSE for responsive UIs and CLIs.
- Built‑in safeguards: per‑key rate limiting (in‑memory or Redis) and simple usage counters with an admin query endpoint.
- First‑class observability: Prometheus metrics labeled by route and backend, and OpenTelemetry traces exported to your collector (OTLP).
- Resilience controls: configurable retries and circuit breaking around backends to smooth over transient failures.
- Production‑ready on Kubernetes: minimal, non‑root image; secure defaults; Helm chart with ServiceMonitor and optional HPA.
- Easy local development:
make run,.envsupport, and a tinypinectlCLI to manage and test locally.
This path gets you from zero to a working gateway locally, then shows how to try a backend.
- Run the gateway with the example config
CONFIG_FILE=./configs/config.example.yaml make run
Check health and send a test request using the built‑in echo backend:
curl -i http://localhost:8080/healthz
curl -s -H 'x-api-key: dev-key' -H 'Content-Type: application/json' \
-X POST http://localhost:8080/v1/completions -d '{"model":"echo","prompt":"hello"}'
- Enable a real backend (example: OpenRouter)
PINE_GATE_BACKENDS_OPENROUTER_ENABLED=true \
PINE_GATE_BACKENDS_OPENROUTER_APIKEY=<YOUR_KEY> \
CONFIG_FILE=./configs/config.example.yaml make run
Then request a model via the openrouter: prefix:
curl -s -H 'x-api-key: dev-key' -H 'Content-Type: application/json' -X POST \
http://localhost:8080/v1/completions -d '{"model":"openrouter:mistralai/mistral-7b-instruct:free","prompt":"hello"}'
See Backends for other providers and examples.
- Optional: place settings in
.envpine-gate loads a.envfile automatically. Put your environment variables there instead of prefixing commands:
# .env
PINE_GATE_BACKENDS_OLLAMA_ENABLED=true
PINE_GATE_BACKENDS_OLLAMA_HOST=http://localhost:11434
OTEL_EXPORTER_OTLP_ENDPOINT=localhost:4318
PINE_GATE_LIMITS_RATE_RPS=5
PINE_GATE_LIMITS_BURST=10
With .env present, start with:
make run
# or
./bin/pinectl serve --config ./configs/config.example.yaml
- Optional: Redis for shared rate limits and usage
docker run --rm -p 6379:6379 redis:7
PINE_GATE_REDIS_ENABLED=true PINE_GATE_REDIS_ADDR=localhost:6379 \
PINE_GATE_AUTH_ADMIN_KEY=admin CONFIG_FILE=./configs/config.example.yaml make run
curl -s -H 'x-admin-key: admin' 'http://localhost:8080/v1/usage?key=dev-key'
- Optional: Tracing to Jaeger (OTLP)
docker run --rm -p 16686:16686 -p 4318:4318 -e COLLECTOR_OTLP_ENABLED=true jaegertracing/all-in-one:1.57
OTEL_EXPORTER_OTLP_ENDPOINT=localhost:4318 CONFIG_FILE=./configs/config.example.yaml make run
Choose a backend by prefixing the model (e.g., openai:gpt-4o-mini, ollama:llama3). Enable providers via environment variables.
- Echo: built‑in for local testing (no network call)
- Ollama: local models via
ollama:<model> - vLLM: OpenAI‑compatible server via
vllm:<model> - OpenAI: hosted models via
openai:<model> - OpenRouter: marketplace via
openrouter:<model> - Anthropic: hosted models via
anthropic:<model>Read more: docs/backends.md
Two core endpoints power synchronous and streaming use cases.
POST /v1/completions— JSON request{ model, prompt }→{ model, output }GET /v1/stream— SSE stream of tokens withmodelandpromptas query params Health and metrics are also available:GET /healthz— service healthGET /metrics— Prometheus metrics Read more: docs/api.md and docs/openapi.yaml
Configuration comes from environment variables, a YAML file, and defaults. Environment variables (including from .env) take precedence.
Read more: docs/configuration.md
Prometheus metrics include request rate, latency, errors, and backend labels. OpenTelemetry spans trace requests and backend calls. Read more: docs/observability.md and docs/tracing.md
Per‑key token buckets enforce rate limits. Use Redis to share limits and counters across replicas. Read more: docs/quotas-limits.md
Use the Helm chart for production‑grade defaults and easy toggles. ServiceMonitor, HPA, and OTel are available via values. Quick install:
helm install pine-gate charts/pine-gate --set auth.apiKey=dev-key
kubectl port-forward deploy/pine-gate 8080:8080
Read more: docs/deploy-k8s.md and charts/pine-gate/README.md
The container runs as non‑root with a read‑only filesystem and dropped capabilities; security contexts are set in the chart. Read more: docs/security.md
pinectl helps you run the gateway locally, print effective config, send test requests, and open a tiny TUI dashboard.
Read more: docs/cli.md
