Skip to content

Commit 5abe7d5

Browse files
authored
Added support for though signature for gemini 3 in responses api (#16872)
* Added support for though signature for gemini 3 * Update docs with all supported endpoints and cost tracking
1 parent 18fceff commit 5abe7d5

File tree

4 files changed

+557
-39
lines changed

4 files changed

+557
-39
lines changed

docs/my-website/blog/gemini_3/index.md

Lines changed: 284 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,7 @@ date: 2025-11-19T10:00:00
55
authors:
66
- name: Sameer Kankute
77
title: SWE @ LiteLLM (LLM Translation)
8-
url: https://www.linkedin.com/in/krish-d/
8+
url: https://www.linkedin.com/in/sameer-kankute/
99
image_url: https://media.licdn.com/dms/image/v2/D4D03AQHB_loQYd5gjg/profile-displayphoto-shrink_800_800/profile-displayphoto-shrink_800_800/0/1719137160975?e=1765411200&v=beta&t=c8396f--_lH6Fb_pVvx_jGholPfcl0bvwmNynbNdnII
1010
- name: Krrish Dholakia
1111
title: CEO, LiteLLM
@@ -88,9 +88,11 @@ curl http://0.0.0.0:4000/v1/chat/completions \
8888
LiteLLM provides **full end-to-end support** for Gemini 3 Pro Preview on:
8989

9090
-`/v1/chat/completions` - OpenAI-compatible chat completions endpoint
91+
-`/v1/responses` - OpenAI Responses API endpoint (streaming and non-streaming)
9192
-[`/v1/messages`](../../docs/anthropic_unified) - Anthropic-compatible messages endpoint
93+
-`/v1/generateContent`[Google Gemini API](https://cloud.google.com/vertex-ai/docs/generative-ai/model-reference/gemini#rest) compatible endpoint (for code, see: `client.models.generate_content(...)`)
9294

93-
Both endpoints support:
95+
All endpoints support:
9496
- Streaming and non-streaming responses
9597
- Function calling with thought signatures
9698
- Multi-turn conversations
@@ -548,6 +550,129 @@ curl http://localhost:4000/v1/chat/completions \
548550

549551
3. **Automatic Defaults**: If you don't specify `reasoning_effort`, LiteLLM automatically sets `thinking_level="low"` for optimal performance.
550552

553+
## Cost Tracking: Prompt Caching & Context Window
554+
555+
LiteLLM provides comprehensive cost tracking for Gemini 3 Pro Preview, including support for prompt caching and tiered pricing based on context window size.
556+
557+
### Prompt Caching Cost Tracking
558+
559+
Gemini 3 supports prompt caching, which allows you to cache frequently used prompt prefixes to reduce costs. LiteLLM automatically tracks and calculates costs for:
560+
561+
- **Cache Hit Tokens**: Tokens that are read from cache (charged at a lower rate)
562+
- **Cache Creation Tokens**: Tokens that are written to cache (one-time cost)
563+
- **Text Tokens**: Regular prompt tokens that are processed normally
564+
565+
#### How It Works
566+
567+
LiteLLM extracts caching information from the `prompt_tokens_details` field in the usage object:
568+
569+
```python
570+
{
571+
"usage": {
572+
"prompt_tokens": 50000,
573+
"completion_tokens": 1000,
574+
"total_tokens": 51000,
575+
"prompt_tokens_details": {
576+
"cached_tokens": 30000, # Cache hit tokens
577+
"cache_creation_tokens": 5000, # Tokens written to cache
578+
"text_tokens": 15000 # Regular processed tokens
579+
}
580+
}
581+
}
582+
```
583+
584+
### Context Window Tiered Pricing
585+
586+
Gemini 3 Pro Preview supports up to 1M tokens of context, with tiered pricing that automatically applies when your prompt exceeds 200k tokens.
587+
588+
#### Automatic Tier Detection
589+
590+
LiteLLM automatically detects when your prompt exceeds the 200k token threshold and applies the appropriate tiered pricing:
591+
592+
```python
593+
from litellm import completion_cost
594+
595+
# Example: Small prompt (< 200k tokens)
596+
response_small = completion(
597+
model="gemini/gemini-3-pro-preview",
598+
messages=[{"role": "user", "content": "Hello!"}]
599+
)
600+
# Uses base pricing: $0.000002/input token, $0.000012/output token
601+
602+
# Example: Large prompt (> 200k tokens)
603+
response_large = completion(
604+
model="gemini/gemini-3-pro-preview",
605+
messages=[{"role": "user", "content": "..." * 250000}] # 250k tokens
606+
)
607+
# Automatically uses tiered pricing: $0.000004/input token, $0.000018/output token
608+
```
609+
610+
#### Cost Breakdown
611+
612+
The cost calculation includes:
613+
614+
1. **Text Processing Cost**: Regular tokens processed at base or tiered rate
615+
2. **Cache Read Cost**: Cached tokens read at discounted rate
616+
3. **Cache Creation Cost**: One-time cost for writing tokens to cache (applies tiered rate if above 200k)
617+
4. **Output Cost**: Generated tokens at base or tiered rate
618+
619+
### Example: Viewing Cost Breakdown
620+
621+
You can view the detailed cost breakdown using LiteLLM's cost tracking:
622+
623+
```python
624+
from litellm import completion, completion_cost
625+
626+
response = completion(
627+
model="gemini/gemini-3-pro-preview",
628+
messages=[{"role": "user", "content": "Explain prompt caching"}],
629+
caching=True # Enable prompt caching
630+
)
631+
632+
# Get total cost
633+
total_cost = completion_cost(completion_response=response)
634+
print(f"Total cost: ${total_cost:.6f}")
635+
636+
# Access usage details
637+
usage = response.usage
638+
print(f"Prompt tokens: {usage.prompt_tokens}")
639+
print(f"Completion tokens: {usage.completion_tokens}")
640+
641+
# Access caching details
642+
if usage.prompt_tokens_details:
643+
print(f"Cache hit tokens: {usage.prompt_tokens_details.cached_tokens}")
644+
print(f"Cache creation tokens: {usage.prompt_tokens_details.cache_creation_tokens}")
645+
print(f"Text tokens: {usage.prompt_tokens_details.text_tokens}")
646+
```
647+
648+
### Cost Optimization Tips
649+
650+
1. **Use Prompt Caching**: For repeated prompt prefixes, enable caching to reduce costs by up to 90% for cached portions
651+
2. **Monitor Context Size**: Be aware that prompts above 200k tokens use tiered pricing (2x for input, 1.5x for output)
652+
3. **Cache Management**: Cache creation tokens are charged once when writing to cache, then subsequent reads are much cheaper
653+
4. **Track Usage**: Use LiteLLM's built-in cost tracking to monitor spending across different token types
654+
655+
### Integration with LiteLLM Proxy
656+
657+
When using LiteLLM Proxy, all cost tracking is automatically logged and available through:
658+
659+
- **Usage Logs**: Detailed token and cost breakdowns in proxy logs
660+
- **Budget Management**: Set budgets and alerts based on actual usage
661+
- **Analytics Dashboard**: View cost trends and breakdowns by token type
662+
663+
```yaml
664+
# config.yaml
665+
model_list:
666+
- model_name: gemini-3-pro-preview
667+
litellm_params:
668+
model: gemini/gemini-3-pro-preview
669+
api_key: os.environ/GEMINI_API_KEY
670+
671+
litellm_settings:
672+
# Enable detailed cost tracking
673+
success_callback: ["langfuse"] # or your preferred logging service
674+
```
675+
551676
## Using with Claude Code CLI
552677
553678
You can use `gemini-3-pro-preview` with **Claude Code CLI** - Anthropic's command-line interface. This allows you to use Gemini 3 Pro Preview with Claude Code's native syntax and workflows.
@@ -628,6 +753,162 @@ $ claude --model gemini-3-pro-preview
628753
- Ensure `GEMINI_API_KEY` is set correctly
629754
- Check LiteLLM proxy logs for detailed error messages
630755

756+
## Responses API Support
757+
758+
LiteLLM fully supports the OpenAI Responses API for Gemini 3 Pro Preview, including both streaming and non-streaming modes. The Responses API provides a structured way to handle multi-turn conversations with function calling, and LiteLLM automatically preserves thought signatures throughout the conversation.
759+
760+
### Example: Using Responses API with Gemini 3
761+
762+
<Tabs>
763+
<TabItem value="sdk" label="Non-Streaming">
764+
765+
```python
766+
from openai import OpenAI
767+
import json
768+
769+
client = OpenAI()
770+
771+
# 1. Define a list of callable tools for the model
772+
tools = [
773+
{
774+
"type": "function",
775+
"name": "get_horoscope",
776+
"description": "Get today's horoscope for an astrological sign.",
777+
"parameters": {
778+
"type": "object",
779+
"properties": {
780+
"sign": {
781+
"type": "string",
782+
"description": "An astrological sign like Taurus or Aquarius",
783+
},
784+
},
785+
"required": ["sign"],
786+
},
787+
},
788+
]
789+
790+
def get_horoscope(sign):
791+
return f"{sign}: Next Tuesday you will befriend a baby otter."
792+
793+
# Create a running input list we will add to over time
794+
input_list = [
795+
{"role": "user", "content": "What is my horoscope? I am an Aquarius."}
796+
]
797+
798+
# 2. Prompt the model with tools defined
799+
response = client.responses.create(
800+
model="gemini-3-pro-preview",
801+
tools=tools,
802+
input=input_list,
803+
)
804+
805+
# Save function call outputs for subsequent requests
806+
input_list += response.output
807+
808+
for item in response.output:
809+
if item.type == "function_call":
810+
if item.name == "get_horoscope":
811+
# 3. Execute the function logic for get_horoscope
812+
horoscope = get_horoscope(json.loads(item.arguments))
813+
814+
# 4. Provide function call results to the model
815+
input_list.append({
816+
"type": "function_call_output",
817+
"call_id": item.call_id,
818+
"output": json.dumps({
819+
"horoscope": horoscope
820+
})
821+
})
822+
823+
print("Final input:")
824+
print(input_list)
825+
826+
response = client.responses.create(
827+
model="gemini-3-pro-preview",
828+
instructions="Respond only with a horoscope generated by a tool.",
829+
tools=tools,
830+
input=input_list,
831+
)
832+
833+
# 5. The model should be able to give a response!
834+
print("Final output:")
835+
print(response.model_dump_json(indent=2))
836+
print("\n" + response.output_text)
837+
```
838+
839+
**Key Points:**
840+
- ✅ Thought signatures are automatically preserved in function calls
841+
- ✅ Works seamlessly with multi-turn conversations
842+
- ✅ All Gemini 3-specific features are fully supported
843+
844+
</TabItem>
845+
<TabItem value="streaming" label="Streaming">
846+
847+
```python
848+
from openai import OpenAI
849+
import json
850+
851+
client = OpenAI()
852+
853+
tools = [
854+
{
855+
"type": "function",
856+
"name": "get_horoscope",
857+
"description": "Get today's horoscope for an astrological sign.",
858+
"parameters": {
859+
"type": "object",
860+
"properties": {
861+
"sign": {
862+
"type": "string",
863+
"description": "An astrological sign like Taurus or Aquarius",
864+
},
865+
},
866+
"required": ["sign"],
867+
},
868+
},
869+
]
870+
871+
def get_horoscope(sign):
872+
return f"{sign}: Next Tuesday you will befriend a baby otter."
873+
874+
input_list = [
875+
{"role": "user", "content": "What is my horoscope? I am an Aquarius."}
876+
]
877+
878+
# Streaming mode
879+
response = client.responses.create(
880+
model="gemini-3-pro-preview",
881+
tools=tools,
882+
input=input_list,
883+
stream=True,
884+
)
885+
886+
# Collect all chunks
887+
chunks = []
888+
for chunk in response:
889+
chunks.append(chunk)
890+
# Process streaming chunks as they arrive
891+
print(chunk)
892+
893+
# Thought signatures are automatically preserved in streaming mode
894+
```
895+
896+
**Key Points:**
897+
- ✅ Streaming mode fully supported
898+
- ✅ Thought signatures preserved across streaming chunks
899+
- ✅ Real-time processing of function calls and responses
900+
901+
</TabItem>
902+
</Tabs>
903+
904+
### Responses API Benefits
905+
906+
- ✅ **Structured Output**: Responses API provides a clear structure for handling function calls and multi-turn conversations
907+
- ✅ **Thought Signature Preservation**: LiteLLM automatically preserves thought signatures in both streaming and non-streaming modes
908+
- ✅ **Seamless Integration**: Works with existing OpenAI SDK patterns
909+
- ✅ **Full Feature Support**: All Gemini 3 features (thought signatures, function calling, reasoning) are fully supported
910+
911+
631912
## Best Practices
632913

633914
#### 1. Always Include Thought Signatures in Conversation History
@@ -665,6 +946,7 @@ When switching from non-Gemini-3 to Gemini-3:
665946
- ✅ No manual intervention needed
666947
- ✅ Conversation history continues seamlessly
667948

949+
668950
## Troubleshooting
669951

670952
#### Issue: Missing Thought Signatures

0 commit comments

Comments
 (0)