HelixLLM exposes three groups of endpoints: public API (OpenAI/Anthropic compatible), agent API, and internal management API. All endpoints are served over HTTPS on the configured port (default 8443).
When HELIX_AUTH_API_KEYS is configured, all /v1/* endpoints require a Bearer token:
Authorization: Bearer sk-your-api-key
Internal endpoints (/internal/*) do not require authentication.
These endpoints match the OpenAI API specification. Any OpenAI SDK client works without modification -- just point the base URL to your HelixLLM instance.
Create a chat completion. Supports SSE streaming when stream: true.
Request:
{
"model": "Llama-3.1-70B-Instruct-Q4_K_M",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Explain the ReAct pattern."}
],
"temperature": 0.7,
"max_tokens": 1024,
"stream": false
}Response (non-streaming):
{
"id": "chatcmpl-abc123",
"object": "chat.completion",
"created": 1712188800,
"model": "Llama-3.1-70B-Instruct-Q4_K_M",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": "The ReAct pattern combines reasoning and acting..."
},
"finish_reason": "stop"
}
],
"usage": {
"prompt_tokens": 25,
"completion_tokens": 150,
"total_tokens": 175
}
}Streaming response: When stream: true, the server responds with Content-Type: text/event-stream. Each SSE event contains a delta:
data: {"id":"chatcmpl-abc123","object":"chat.completion.chunk","choices":[{"index":0,"delta":{"content":"The "},"finish_reason":null}]}
data: {"id":"chatcmpl-abc123","object":"chat.completion.chunk","choices":[{"index":0,"delta":{"content":"ReAct "},"finish_reason":null}]}
data: [DONE]
Create a text completion (legacy format).
Request:
{
"model": "Llama-3.1-70B-Instruct-Q4_K_M",
"prompt": "Once upon a time",
"max_tokens": 100,
"temperature": 0.9
}Response:
{
"id": "cmpl-abc123",
"object": "text_completion",
"created": 1712188800,
"model": "Llama-3.1-70B-Instruct-Q4_K_M",
"choices": [
{
"text": " in a land far away...",
"index": 0,
"finish_reason": "stop"
}
],
"usage": {
"prompt_tokens": 5,
"completion_tokens": 50,
"total_tokens": 55
}
}List all available models across all configured providers.
Response:
{
"object": "list",
"data": [
{
"id": "Llama-3.1-70B-Instruct-Q4_K_M",
"object": "model",
"created": 1712188800,
"owned_by": "local"
},
{
"id": "gpt-4o",
"object": "model",
"created": 1712188800,
"owned_by": "openai"
}
]
}Get details for a specific model.
Response:
{
"id": "Llama-3.1-70B-Instruct-Q4_K_M",
"object": "model",
"created": 1712188800,
"owned_by": "local"
}Generate embeddings for input text.
Request:
{
"model": "all-mpnet-base-v2",
"input": "The quick brown fox jumps over the lazy dog"
}Response:
{
"object": "list",
"data": [
{
"object": "embedding",
"index": 0,
"embedding": [0.0123, -0.0456, 0.0789, ...]
}
],
"model": "all-mpnet-base-v2",
"usage": {
"prompt_tokens": 9,
"total_tokens": 9
}
}These endpoints match the Anthropic Messages API specification. The anthropic-version header is respected.
Create a message using the Anthropic Messages format.
Request:
{
"model": "claude-sonnet-4-20250514",
"max_tokens": 1024,
"messages": [
{"role": "user", "content": "What is HelixLLM?"}
]
}Headers:
Content-Type: application/json
anthropic-version: 2023-06-01
Response (non-streaming):
{
"id": "msg_abc123",
"type": "message",
"role": "assistant",
"content": [
{
"type": "text",
"text": "HelixLLM is an enterprise-grade distributed LLM system..."
}
],
"model": "claude-sonnet-4-20250514",
"stop_reason": "end_turn",
"usage": {
"input_tokens": 12,
"output_tokens": 150
}
}Streaming: Set "stream": true in the request body. The response uses SSE with Anthropic's event format:
event: message_start
data: {"type":"message_start","message":{"id":"msg_abc123","type":"message","role":"assistant","model":"claude-sonnet-4-20250514"}}
event: content_block_delta
data: {"type":"content_block_delta","index":0,"delta":{"type":"text_delta","text":"HelixLLM "}}
event: message_stop
data: {"type":"message_stop"}
Run the ReAct agent loop. The agent reasons about the query, optionally calls tools, and returns a final response. Supports multi-turn sessions via session_id.
Request:
{
"session_id": "sess_abc123",
"messages": [
{"role": "user", "content": "What time is it?"}
],
"model": "Llama-3.1-70B-Instruct-Q4_K_M"
}session_id(optional): When provided, conversation history is preserved across calls.messages(required): New messages for this turn.model(optional): Model hint for provider selection.
Response:
{
"session_id": "sess_abc123",
"response": {
"message": {
"role": "assistant",
"content": "The current time is 2026-04-04T14:30:00Z."
},
"usage": {
"prompt_tokens": 50,
"completion_tokens": 20,
"total_tokens": 70
}
}
}List all registered tools available to the agent.
Response:
{
"tools": [
{
"name": "echo",
"description": "Echoes back the input message",
"parameters": {
"message": {"type": "string", "description": "Message to echo"}
}
},
{
"name": "time",
"description": "Returns the current UTC time",
"parameters": {}
},
{
"name": "knowledge_query",
"description": "Query the knowledge base for relevant documents",
"parameters": {
"query": {"type": "string", "description": "Search query"}
}
}
]
}Internal endpoints for the RAG knowledge pipeline. These are under /internal/ and do not require API key authentication.
Ingest a document into the knowledge base. The document is chunked, embedded, and stored in the vector database.
Request:
{
"content": "HelixLLM is an enterprise-grade distributed LLM system built in Go...",
"collection": "docs",
"metadata": {
"source": "readme.md",
"title": "HelixLLM Overview"
}
}Response:
{
"id": "550e8400-e29b-41d4-a716-446655440000",
"collection": "docs",
"chunks": 3,
"status": "completed"
}Query the knowledge base using semantic search.
Request:
{
"query": "How does the mode system work?",
"collection": "docs",
"top_k": 5
}Response:
{
"results": [
{
"content": "HelixLLM compiles to a single binary that operates in one of six modes...",
"score": 0.92,
"metadata": {
"source": "architecture.md",
"chunk_index": 0
}
}
]
}List all vector collections.
Response:
[
{
"name": "default",
"document_count": 42,
"chunk_count": 156
}
]Get knowledge base statistics.
Response:
{
"total_documents": 42,
"total_chunks": 156,
"collections": 1,
"embedding_dimensions": 768
}Internal endpoints for multi-host cluster management.
Get the current cluster status including host health and service placements.
Response:
{
"checked_at": "2026-04-04T14:30:00Z",
"healthy": true,
"hosts": [],
"deployments": []
}Probe all configured hosts via SSH to detect capabilities (OS, CPU, RAM, GPU, container runtime).
Response:
{
"hosts": [
{
"hostname": "nezha.local",
"reachable": true,
"os": "linux",
"cpu_cores": 16,
"memory_mb": 65536,
"gpu": "NVIDIA RTX 4090",
"container_runtime": "podman"
}
]
}Schedule and deploy services to the cluster.
Request:
{
"services": [
{
"name": "llama-cpp",
"image": "ghcr.io/ggml-org/llama.cpp:server-cuda",
"requires_gpu": true,
"memory_mb": 16384
}
]
}Response:
{
"deployments": [
{
"service_name": "llama-cpp",
"host": "nezha.local",
"status": "running"
}
],
"placements": [
{
"service": "llama-cpp",
"host": "nezha.local",
"strategy": "gpu-affinity"
}
]
}Re-evaluate and rebalance placement of existing deployments based on current host conditions.
Response:
{
"placements": [],
"deployments": [],
"errors": []
}Aggregated health check across all subsystems.
Response (healthy):
{
"status": "healthy",
"checks": []
}Response (unhealthy): Returns HTTP 503 with:
{
"status": "unhealthy",
"checks": [
{
"name": "database",
"status": "unhealthy",
"error": "connection refused"
}
]
}All endpoints return errors in a consistent format:
{
"error": {
"message": "invalid request body: missing required field 'messages'",
"type": "invalid_request_error"
}
}Common HTTP status codes:
| Code | Meaning |
|---|---|
| 400 | Bad request (invalid JSON, missing fields, validation error) |
| 401 | Unauthorized (missing or invalid API key) |
| 429 | Rate limited (too many requests) |
| 500 | Internal server error |
| 503 | Service unavailable (health check failed) |
Compression: The server supports Brotli (br) and gzip compression. Set the Accept-Encoding header:
Accept-Encoding: br, gzip
Brotli is preferred when both are accepted.
Serialization: The default format is JSON. When TOON is enabled (HELIX_FEATURE_TOON=true), you can request TOON format:
Accept: application/toon
from openai import OpenAI
client = OpenAI(
base_url="https://localhost:8443/v1",
api_key="your-api-key",
)
response = client.chat.completions.create(
model="Llama-3.1-70B-Instruct-Q4_K_M",
messages=[{"role": "user", "content": "Hello!"}],
)
print(response.choices[0].message.content)from anthropic import Anthropic
client = Anthropic(
base_url="https://localhost:8443",
api_key="your-api-key",
)
response = client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=1024,
messages=[{"role": "user", "content": "Hello!"}],
)
print(response.content[0].text)