A sophisticated load balancing and failover plugin for OptiLLM that distributes requests across multiple LLM providers.
- 🔄 Load Balancing: Distribute requests across multiple providers using weighted, round-robin, or failover strategies
- 🏥 Health Monitoring: Automatic health checks with provider failover
- 🔌 Universal Compatibility: Works with any OptiLLM approach or plugin
- 🌍 Environment Variables: Secure configuration with environment variable support
- 📊 Performance Tracking: Monitor latency and errors per provider
- 🗺️ Model Mapping: Map model names to provider-specific deployments
# Install OptiLLM via pip
pip install optillm
# Verify installation
optillm --versionCreate ~/.optillm/proxy_config.yaml:
providers:
- name: primary
base_url: https://api.openai.com/v1
api_key: ${OPENAI_API_KEY}
weight: 2
max_concurrent: 5 # Optional: limit this provider to 5 concurrent requests
model_map:
gpt-4: gpt-4-turbo-preview # Optional: map model names
- name: backup
base_url: https://api.openai.com/v1
api_key: ${OPENAI_API_KEY_BACKUP}
weight: 1
max_concurrent: 2 # Optional: limit this provider to 2 concurrent requests
routing:
strategy: weighted # Options: weighted, round_robin, failover
timeouts:
request: 30 # Maximum seconds to wait for a provider response
connect: 5 # Maximum seconds to wait for connection
queue:
max_concurrent: 100 # Maximum concurrent requests to prevent overload
timeout: 60 # Maximum seconds a request can wait in queue# Option A: Use proxy as default for ALL requests (recommended)
optillm --approach proxy
# Option B: Start server normally (use model prefix or extra_body per request)
optillm
# With custom port
optillm --approach proxy --port 8000# Start server with proxy as default approach
optillm --approach proxy
# Then make normal requests - proxy handles all routing automatically!
curl -X POST http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "gpt-4",
"messages": [{"role": "user", "content": "Hello"}]
}'# Use "proxy-" prefix to activate the proxy plugin
curl -X POST http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "proxy-gpt-4",
"messages": [{"role": "user", "content": "Hello"}]
}'# Use extra_body parameter
curl -X POST http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "gpt-4",
"messages": [{"role": "user", "content": "Hello"}],
"extra_body": {
"optillm_approach": "proxy"
}
}'Both methods will:
- Route to one of your configured providers
- Apply model mapping if configured
- Handle failover automatically
# Apply BON sampling, then route through proxy
curl -X POST http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "bon&proxy-gpt-4",
"messages": [{"role": "user", "content": "Generate ideas"}]
}'# Use proxy to wrap MOA approach
curl -X POST http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "gpt-4",
"messages": [{"role": "user", "content": "Solve this problem"}],
"extra_body": {
"optillm_approach": "proxy",
"proxy_wrap": "moa"
}
}'Each provider supports the following options:
providers:
- name: provider_name # Required: Unique identifier
base_url: https://api.url/v1 # Required: API endpoint
api_key: ${ENV_VAR} # Required: API key (supports env vars)
weight: 2 # Optional: Weight for weighted routing (default: 1)
fallback_only: false # Optional: Use only when primary providers fail
model_map: # Optional: Map model names
gpt-4: gpt-4-deployment
gpt-3.5-turbo: gpt-35-turborouting:
strategy: weighted # Options: weighted, round_robin, failover
# Health check configuration
health_check:
enabled: true # Enable/disable health checks
interval: 30 # Seconds between checks
timeout: 5 # Timeout for health check requestsPrevent request queue backup and handle slow/unresponsive backends:
timeouts:
request: 30 # Maximum seconds to wait for provider response (default: 30)
connect: 5 # Maximum seconds for initial connection (default: 5)
queue:
max_concurrent: 100 # Maximum concurrent requests (default: 100)
timeout: 60 # Maximum seconds in queue before rejection (default: 60)How it works:
- Request Timeout: Each request to a provider has a maximum time limit. If exceeded, the request is cancelled and the next provider is tried.
- Queue Management: Limits concurrent requests to prevent memory exhaustion. New requests wait up to
queue.timeoutseconds before being rejected. - Automatic Failover: When a provider times out, it's marked unhealthy and the request automatically fails over to the next available provider.
- Protection: Prevents slow backends from causing queue buildup that can crash the proxy server.
Control the maximum number of concurrent requests each provider can handle:
providers:
- name: slow_server
base_url: http://192.168.1.100:8080/v1
api_key: dummy
max_concurrent: 1 # This server can only handle 1 request at a time
- name: fast_server
base_url: https://api.fast.com/v1
api_key: ${API_KEY}
max_concurrent: 10 # This server can handle 10 concurrent requests
- name: unlimited_server
base_url: https://api.unlimited.com/v1
api_key: ${API_KEY}
# No max_concurrent means no limit for this providerUse Cases:
- Hardware-limited servers: Set
max_concurrent: 1for servers that can't handle parallel requests - Rate limiting: Prevent overwhelming providers with too many concurrent requests
- Resource management: Balance load across providers with different capacities
- Cost control: Limit expensive providers while allowing more requests to cheaper ones
Behavior:
- If a provider is at max capacity, the proxy tries the next available provider
- Requests wait briefly (0.5s) for a slot before moving to the next provider
- Works with all routing strategies (weighted, round_robin, failover)
The configuration supports flexible environment variable interpolation:
# Simple substitution
api_key: ${OPENAI_API_KEY}
# With default value
base_url: ${CUSTOM_ENDPOINT:-https://api.openai.com/v1}
# Nested variables
api_key: ${ENV_PREFIX}_API_KEYControl provider selection priority using weights:
providers:
- name: premium
base_url: https://premium.api/v1
api_key: ${PREMIUM_KEY}
weight: 5 # Gets 5x more traffic
- name: standard
base_url: https://standard.api/v1
api_key: ${STANDARD_KEY}
weight: 1 # Baseline trafficThe proxy automatically maps model names to provider-specific deployments:
providers:
- name: azure
base_url: ${AZURE_ENDPOINT}
api_key: ${AZURE_KEY}
model_map:
# Request model -> Provider deployment name
gpt-4: gpt-4-deployment-001
gpt-4-turbo: gpt-4-turbo-latest
gpt-3.5-turbo: gpt-35-turbo-deployment
- name: openai
base_url: https://api.openai.com/v1
api_key: ${OPENAI_API_KEY}
# No model_map needed - uses model names as-isWith this configuration and proxy-gpt-4 model requests:
- Request for "proxy-gpt-4" → Azure uses "gpt-4-deployment-001", OpenAI uses "gpt-4"
- Request for "proxy-gpt-3.5-turbo" → Azure uses "gpt-35-turbo-deployment", OpenAI uses "gpt-3.5-turbo"
Set up primary and backup providers:
providers:
# Primary providers (normal traffic)
- name: primary_1
base_url: https://api1.com/v1
api_key: ${KEY_1}
weight: 3
- name: primary_2
base_url: https://api2.com/v1
api_key: ${KEY_2}
weight: 2
# Backup provider (only on failure)
- name: emergency_backup
base_url: https://backup.api/v1
api_key: ${BACKUP_KEY}
fallback_only: true # Only used when all primary providers failEnable detailed logging for debugging:
monitoring:
log_level: DEBUG # Options: DEBUG, INFO, WARNING, ERROR
track_latency: true
track_errors: trueThe proxy automatically monitors provider health. Failed providers are:
- Marked as unhealthy after errors
- Excluded from routing
- Periodically rechecked for recovery
- Automatically restored when healthy
When track_latency is enabled, the proxy logs:
- Request latency per provider
- Success/failure rates
- Provider selection patterns
- Check your API keys are correctly set
- Verify base URLs are accessible
- Review health check logs for specific errors
- Ensure at least one provider is configured
- Check provider-specific API limits
- Verify model names in model_map
- Test the provider's endpoint directly
- Review error logs for details
- Ensure using correct extra_body format
- Verify approach/plugin name is correct
- Check that target approach is installed
Enable debug logging to see detailed routing decisions:
export OPTILLM_LOG_LEVEL=DEBUG
python optillm.py- Multiple API Keys: Use different API keys per provider for better rate limit distribution
- Weight Tuning: Adjust weights based on provider performance and cost
- Health Intervals: Balance between quick failure detection (short) and API overhead (long)
- Fallback Providers: Always configure at least one fallback provider
- Environment Security: Never commit API keys; always use environment variables
routing:
strategy: weighted # Better distribution than round_robin
health_check:
interval: 60 # Reduce health check frequency
timeout: 10 # Allow longer timeout for stabilityrouting:
strategy: failover # Always use fastest provider
health_check:
interval: 10 # Quick failure detection
timeout: 2 # Fast timeoutproviders:
- name: cheap_provider
weight: 10 # Prefer cheaper provider
- name: expensive_provider
weight: 1 # Minimize usage
fallback_only: true # Or only use on failurefrom openai import OpenAI
client = OpenAI(
base_url="http://localhost:8000/v1",
api_key="dummy" # Can be any string when using proxy
)
# Method 1: Server started with --approach proxy (recommended)
# Just make normal requests - proxy handles everything!
response = client.chat.completions.create(
model="gpt-4",
messages=[{"role": "user", "content": "Hello"}]
)
# Method 2: Use proxy with model prefix
response = client.chat.completions.create(
model="proxy-gpt-4", # Use "proxy-" prefix
messages=[{"role": "user", "content": "Hello"}]
)
# Method 3: Use extra_body
response = client.chat.completions.create(
model="gpt-4",
messages=[{"role": "user", "content": "Hello"}],
extra_body={
"optillm_approach": "proxy"
}
)
# Method 4: Proxy wrapping another approach
response = client.chat.completions.create(
model="gpt-4",
messages=[{"role": "user", "content": "Hello"}],
extra_body={
"optillm_approach": "proxy",
"proxy_wrap": "moa"
}
)from langchain.llms import OpenAI
# If server started with --approach proxy (recommended)
llm = OpenAI(
openai_api_base="http://localhost:8000/v1",
model_name="gpt-4" # Proxy handles routing automatically
)
# Or use proxy with model prefix
llm = OpenAI(
openai_api_base="http://localhost:8000/v1",
model_name="proxy-gpt-4" # Use "proxy-" prefix
)
response = llm("What is the meaning of life?")The proxy works with any OpenAI-compatible API:
- ✅ OpenAI
- ✅ Azure OpenAI
- ✅ Anthropic (via LiteLLM)
- ✅ Google AI (via LiteLLM)
- ✅ Cohere (via LiteLLM)
- ✅ Together AI
- ✅ Anyscale
- ✅ Local models (Ollama, LM Studio, llama.cpp)
- ✅ Any OpenAI-compatible endpoint
providers:
- name: openai
base_url: https://api.openai.com/v1
api_key: ${OPENAI_API_KEY}
weight: 3
- name: azure
base_url: ${AZURE_ENDPOINT}
api_key: ${AZURE_API_KEY}
weight: 2
model_map:
gpt-4: gpt-4-deployment
- name: together
base_url: https://api.together.xyz/v1
api_key: ${TOGETHER_API_KEY}
weight: 1providers:
- name: local_primary
base_url: http://localhost:8080/v1
api_key: local
weight: 1
- name: local_backup
base_url: http://localhost:8081/v1
api_key: local
weight: 1
routing:
strategy: round_robin
health_check:
enabled: false # Disable for local devproviders:
- name: prod_primary
base_url: https://api.openai.com/v1
api_key: ${PROD_OPENAI_KEY_1}
weight: 5
- name: prod_secondary
base_url: https://api.openai.com/v1
api_key: ${PROD_OPENAI_KEY_2}
weight: 3
- name: prod_fallback
base_url: ${FALLBACK_ENDPOINT}
api_key: ${FALLBACK_KEY}
weight: 1
fallback_only: true
routing:
strategy: weighted
health_check:
enabled: true
interval: 30
timeout: 5
monitoring:
log_level: WARNING
track_latency: true
track_errors: trueTo add new routing strategies or features:
- Implement new strategy in
routing.py - Add strategy to RouterFactory
- Update documentation
- Add tests
Part of OptiLLM - see main project license.