You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
3.**Automatic Defaults**: If you don't specify `reasoning_effort`, LiteLLM automatically sets `thinking_level="low"` for optimal performance.
550
552
553
+
## Cost Tracking: Prompt Caching & Context Window
554
+
555
+
LiteLLM provides comprehensive cost tracking for Gemini 3 Pro Preview, including support for prompt caching and tiered pricing based on context window size.
556
+
557
+
### Prompt Caching Cost Tracking
558
+
559
+
Gemini 3 supports prompt caching, which allows you to cache frequently used prompt prefixes to reduce costs. LiteLLM automatically tracks and calculates costs for:
560
+
561
+
-**Cache Hit Tokens**: Tokens that are read from cache (charged at a lower rate)
562
+
-**Cache Creation Tokens**: Tokens that are written to cache (one-time cost)
563
+
-**Text Tokens**: Regular prompt tokens that are processed normally
564
+
565
+
#### How It Works
566
+
567
+
LiteLLM extracts caching information from the `prompt_tokens_details` field in the usage object:
568
+
569
+
```python
570
+
{
571
+
"usage": {
572
+
"prompt_tokens": 50000,
573
+
"completion_tokens": 1000,
574
+
"total_tokens": 51000,
575
+
"prompt_tokens_details": {
576
+
"cached_tokens": 30000, # Cache hit tokens
577
+
"cache_creation_tokens": 5000, # Tokens written to cache
578
+
"text_tokens": 15000# Regular processed tokens
579
+
}
580
+
}
581
+
}
582
+
```
583
+
584
+
### Context Window Tiered Pricing
585
+
586
+
Gemini 3 Pro Preview supports up to 1M tokens of context, with tiered pricing that automatically applies when your prompt exceeds 200k tokens.
587
+
588
+
#### Automatic Tier Detection
589
+
590
+
LiteLLM automatically detects when your prompt exceeds the 200k token threshold and applies the appropriate tiered pricing:
591
+
592
+
```python
593
+
from litellm import completion_cost
594
+
595
+
# Example: Small prompt (< 200k tokens)
596
+
response_small = completion(
597
+
model="gemini/gemini-3-pro-preview",
598
+
messages=[{"role": "user", "content": "Hello!"}]
599
+
)
600
+
# Uses base pricing: $0.000002/input token, $0.000012/output token
1.**Use Prompt Caching**: For repeated prompt prefixes, enable caching to reduce costs by up to 90% for cached portions
651
+
2.**Monitor Context Size**: Be aware that prompts above 200k tokens use tiered pricing (2x for input, 1.5x for output)
652
+
3.**Cache Management**: Cache creation tokens are charged once when writing to cache, then subsequent reads are much cheaper
653
+
4.**Track Usage**: Use LiteLLM's built-in cost tracking to monitor spending across different token types
654
+
655
+
### Integration with LiteLLM Proxy
656
+
657
+
When using LiteLLM Proxy, all cost tracking is automatically logged and available through:
658
+
659
+
-**Usage Logs**: Detailed token and cost breakdowns in proxy logs
660
+
-**Budget Management**: Set budgets and alerts based on actual usage
661
+
-**Analytics Dashboard**: View cost trends and breakdowns by token type
662
+
663
+
```yaml
664
+
# config.yaml
665
+
model_list:
666
+
- model_name: gemini-3-pro-preview
667
+
litellm_params:
668
+
model: gemini/gemini-3-pro-preview
669
+
api_key: os.environ/GEMINI_API_KEY
670
+
671
+
litellm_settings:
672
+
# Enable detailed cost tracking
673
+
success_callback: ["langfuse"] # or your preferred logging service
674
+
```
675
+
551
676
## Using with Claude Code CLI
552
677
553
678
You can use `gemini-3-pro-preview` with **Claude Code CLI** - Anthropic's command-line interface. This allows you to use Gemini 3 Pro Preview with Claude Code's native syntax and workflows.
@@ -628,6 +753,162 @@ $ claude --model gemini-3-pro-preview
628
753
- Ensure `GEMINI_API_KEY` is set correctly
629
754
- Check LiteLLM proxy logs for detailed error messages
630
755
756
+
## Responses API Support
757
+
758
+
LiteLLM fully supports the OpenAI Responses API for Gemini 3 Pro Preview, including both streaming and non-streaming modes. The Responses API provides a structured way to handle multi-turn conversations with function calling, and LiteLLM automatically preserves thought signatures throughout the conversation.
759
+
760
+
### Example: Using Responses API with Gemini 3
761
+
762
+
<Tabs>
763
+
<TabItem value="sdk" label="Non-Streaming">
764
+
765
+
```python
766
+
from openai import OpenAI
767
+
import json
768
+
769
+
client = OpenAI()
770
+
771
+
# 1. Define a list of callable tools for the model
772
+
tools = [
773
+
{
774
+
"type": "function",
775
+
"name": "get_horoscope",
776
+
"description": "Get today's horoscope for an astrological sign.",
777
+
"parameters": {
778
+
"type": "object",
779
+
"properties": {
780
+
"sign": {
781
+
"type": "string",
782
+
"description": "An astrological sign like Taurus or Aquarius",
783
+
},
784
+
},
785
+
"required": ["sign"],
786
+
},
787
+
},
788
+
]
789
+
790
+
def get_horoscope(sign):
791
+
return f"{sign}: Next Tuesday you will befriend a baby otter."
792
+
793
+
# Create a running input list we will add to over time
794
+
input_list = [
795
+
{"role": "user", "content": "What is my horoscope? I am an Aquarius."}
796
+
]
797
+
798
+
# 2. Prompt the model with tools defined
799
+
response = client.responses.create(
800
+
model="gemini-3-pro-preview",
801
+
tools=tools,
802
+
input=input_list,
803
+
)
804
+
805
+
# Save function call outputs for subsequent requests
0 commit comments