BerriAI
diff --git a/‎docs/my-website/blog/gemini_3/index.md‎
Lines changed: 284 additions & 2 deletions b/‎docs/my-website/blog/gemini_3/index.md‎
Lines changed: 284 additions & 2 deletions
@@ -5,7 +5,7 @@ date: 2025-11-19T10:00:00
 authors:
   - name: Sameer Kankute
     title: SWE @ LiteLLM (LLM Translation)
-    url: https://www.linkedin.com/in/krish-d/
+    url: https://www.linkedin.com/in/sameer-kankute/
     image_url: https://media.licdn.com/dms/image/v2/D4D03AQHB_loQYd5gjg/profile-displayphoto-shrink_800_800/profile-displayphoto-shrink_800_800/0/1719137160975?e=1765411200&v=beta&t=c8396f--_lH6Fb_pVvx_jGholPfcl0bvwmNynbNdnII
   - name: Krrish Dholakia
     title: CEO, LiteLLM
@@ -88,9 +88,11 @@ curl http://0.0.0.0:4000/v1/chat/completions \
 LiteLLM provides **full end-to-end support** for Gemini 3 Pro Preview on:
 
 - ✅ `/v1/chat/completions` - OpenAI-compatible chat completions endpoint
+- ✅ `/v1/responses` - OpenAI Responses API endpoint (streaming and non-streaming)
 - ✅ [`/v1/messages`](../../docs/anthropic_unified) - Anthropic-compatible messages endpoint
+- ✅ `/v1/generateContent` – [Google Gemini API](https://cloud.google.com/vertex-ai/docs/generative-ai/model-reference/gemini#rest) compatible endpoint (for code, see: `client.models.generate_content(...)`)
 
-Both endpoints support:
+All endpoints support:
 - Streaming and non-streaming responses
 - Function calling with thought signatures
 - Multi-turn conversations
@@ -548,6 +550,129 @@ curl http://localhost:4000/v1/chat/completions \
 
 3. **Automatic Defaults**: If you don't specify `reasoning_effort`, LiteLLM automatically sets `thinking_level="low"` for optimal performance.
 
+## Cost Tracking: Prompt Caching & Context Window
+
+LiteLLM provides comprehensive cost tracking for Gemini 3 Pro Preview, including support for prompt caching and tiered pricing based on context window size.
+
+### Prompt Caching Cost Tracking
+
+Gemini 3 supports prompt caching, which allows you to cache frequently used prompt prefixes to reduce costs. LiteLLM automatically tracks and calculates costs for:
+
+- **Cache Hit Tokens**: Tokens that are read from cache (charged at a lower rate)
+- **Cache Creation Tokens**: Tokens that are written to cache (one-time cost)
+- **Text Tokens**: Regular prompt tokens that are processed normally
+
+#### How It Works
+
+LiteLLM extracts caching information from the `prompt_tokens_details` field in the usage object:
+
+```python
+{
+  "usage": {
+    "prompt_tokens": 50000,
+    "completion_tokens": 1000,
+    "total_tokens": 51000,
+    "prompt_tokens_details": {
+      "cached_tokens": 30000,  # Cache hit tokens
+      "cache_creation_tokens": 5000,  # Tokens written to cache
+      "text_tokens": 15000  # Regular processed tokens
+    }
+  }
+}
+```
+
+### Context Window Tiered Pricing
+
+Gemini 3 Pro Preview supports up to 1M tokens of context, with tiered pricing that automatically applies when your prompt exceeds 200k tokens.
+
+#### Automatic Tier Detection
+
+LiteLLM automatically detects when your prompt exceeds the 200k token threshold and applies the appropriate tiered pricing:
+
+```python
+from litellm import completion_cost
+
+# Example: Small prompt (< 200k tokens)
+response_small = completion(
+    model="gemini/gemini-3-pro-preview",
+    messages=[{"role": "user", "content": "Hello!"}]
+)
+# Uses base pricing: $0.000002/input token, $0.000012/output token
+
+# Example: Large prompt (> 200k tokens)
+response_large = completion(
+    model="gemini/gemini-3-pro-preview",
+    messages=[{"role": "user", "content": "..." * 250000}]  # 250k tokens
+)
+# Automatically uses tiered pricing: $0.000004/input token, $0.000018/output token
+```
+
+#### Cost Breakdown
+
+The cost calculation includes:
+
+1. **Text Processing Cost**: Regular tokens processed at base or tiered rate
+2. **Cache Read Cost**: Cached tokens read at discounted rate
+3. **Cache Creation Cost**: One-time cost for writing tokens to cache (applies tiered rate if above 200k)
+4. **Output Cost**: Generated tokens at base or tiered rate
+
+### Example: Viewing Cost Breakdown
+
+You can view the detailed cost breakdown using LiteLLM's cost tracking:
+
+```python
+from litellm import completion, completion_cost
+
+response = completion(
+    model="gemini/gemini-3-pro-preview",
+    messages=[{"role": "user", "content": "Explain prompt caching"}],
+    caching=True  # Enable prompt caching
+)
+
+# Get total cost
+total_cost = completion_cost(completion_response=response)
+print(f"Total cost: ${total_cost:.6f}")
+
+# Access usage details
+usage = response.usage
+print(f"Prompt tokens: {usage.prompt_tokens}")
+print(f"Completion tokens: {usage.completion_tokens}")
+
+# Access caching details
+if usage.prompt_tokens_details:
+    print(f"Cache hit tokens: {usage.prompt_tokens_details.cached_tokens}")
+    print(f"Cache creation tokens: {usage.prompt_tokens_details.cache_creation_tokens}")
+    print(f"Text tokens: {usage.prompt_tokens_details.text_tokens}")
+```
+
+### Cost Optimization Tips
+
+1. **Use Prompt Caching**: For repeated prompt prefixes, enable caching to reduce costs by up to 90% for cached portions
+2. **Monitor Context Size**: Be aware that prompts above 200k tokens use tiered pricing (2x for input, 1.5x for output)
+3. **Cache Management**: Cache creation tokens are charged once when writing to cache, then subsequent reads are much cheaper
+4. **Track Usage**: Use LiteLLM's built-in cost tracking to monitor spending across different token types
+
+### Integration with LiteLLM Proxy
+
+When using LiteLLM Proxy, all cost tracking is automatically logged and available through:
+
+- **Usage Logs**: Detailed token and cost breakdowns in proxy logs
+- **Budget Management**: Set budgets and alerts based on actual usage
+- **Analytics Dashboard**: View cost trends and breakdowns by token type
+
+```yaml
+# config.yaml
+model_list:
+  - model_name: gemini-3-pro-preview
+    litellm_params:
+      model: gemini/gemini-3-pro-preview
+      api_key: os.environ/GEMINI_API_KEY
+
+litellm_settings:
+  # Enable detailed cost tracking
+  success_callback: ["langfuse"]  # or your preferred logging service
+```
+
 ## Using with Claude Code CLI
 
 You can use `gemini-3-pro-preview` with **Claude Code CLI** - Anthropic's command-line interface. This allows you to use Gemini 3 Pro Preview with Claude Code's native syntax and workflows.
@@ -628,6 +753,162 @@ $ claude --model gemini-3-pro-preview
 - Ensure `GEMINI_API_KEY` is set correctly
 - Check LiteLLM proxy logs for detailed error messages
 
+## Responses API Support
+
+LiteLLM fully supports the OpenAI Responses API for Gemini 3 Pro Preview, including both streaming and non-streaming modes. The Responses API provides a structured way to handle multi-turn conversations with function calling, and LiteLLM automatically preserves thought signatures throughout the conversation.
+
+### Example: Using Responses API with Gemini 3
+
+<Tabs>
+<TabItem value="sdk" label="Non-Streaming">
+
+```python
+from openai import OpenAI
+import json
+
+client = OpenAI()
+
+# 1. Define a list of callable tools for the model
+tools = [
+    {
+        "type": "function",
+        "name": "get_horoscope",
+        "description": "Get today's horoscope for an astrological sign.",
+        "parameters": {
+            "type": "object",
+            "properties": {
+                "sign": {
+                    "type": "string",
+                    "description": "An astrological sign like Taurus or Aquarius",
+                },
+            },
+            "required": ["sign"],
+        },
+    },
+]
+
+def get_horoscope(sign):
+    return f"{sign}: Next Tuesday you will befriend a baby otter."
+
+# Create a running input list we will add to over time
+input_list = [
+    {"role": "user", "content": "What is my horoscope? I am an Aquarius."}
+]
+
+# 2. Prompt the model with tools defined
+response = client.responses.create(
+    model="gemini-3-pro-preview",
+    tools=tools,
+    input=input_list,
+)
+
+# Save function call outputs for subsequent requests
+input_list += response.output
+
+for item in response.output:
+    if item.type == "function_call":
+        if item.name == "get_horoscope":
+            # 3. Execute the function logic for get_horoscope
+            horoscope = get_horoscope(json.loads(item.arguments))
+            
+            # 4. Provide function call results to the model
+            input_list.append({
+                "type": "function_call_output",
+                "call_id": item.call_id,
+                "output": json.dumps({
+                  "horoscope": horoscope
+                })
+            })
+
+print("Final input:")
+print(input_list)
+
+response = client.responses.create(
+    model="gemini-3-pro-preview",
+    instructions="Respond only with a horoscope generated by a tool.",
+    tools=tools,
+    input=input_list,
+)
+
+# 5. The model should be able to give a response!
+print("Final output:")
+print(response.model_dump_json(indent=2))
+print("\n" + response.output_text)
+```
+
+**Key Points:**
+- ✅ Thought signatures are automatically preserved in function calls
+- ✅ Works seamlessly with multi-turn conversations
+- ✅ All Gemini 3-specific features are fully supported
+
+</TabItem>
+<TabItem value="streaming" label="Streaming">
+
+```python
+from openai import OpenAI
+import json
+
+client = OpenAI()
+
+tools = [
+    {
+        "type": "function",
+        "name": "get_horoscope",
+        "description": "Get today's horoscope for an astrological sign.",
+        "parameters": {
+            "type": "object",
+            "properties": {
+                "sign": {
+                    "type": "string",
+                    "description": "An astrological sign like Taurus or Aquarius",
+                },
+            },
+            "required": ["sign"],
+        },
+    },
+]
+
+def get_horoscope(sign):
+    return f"{sign}: Next Tuesday you will befriend a baby otter."
+
+input_list = [
+    {"role": "user", "content": "What is my horoscope? I am an Aquarius."}
+]
+
+# Streaming mode
+response = client.responses.create(
+    model="gemini-3-pro-preview",
+    tools=tools,
+    input=input_list,
+    stream=True,
+)
+
+# Collect all chunks
+chunks = []
+for chunk in response:
+    chunks.append(chunk)
+    # Process streaming chunks as they arrive
+    print(chunk)
+
+# Thought signatures are automatically preserved in streaming mode
+```
+
+**Key Points:**
+- ✅ Streaming mode fully supported
+- ✅ Thought signatures preserved across streaming chunks
+- ✅ Real-time processing of function calls and responses
+
+</TabItem>
+</Tabs>
+
+### Responses API Benefits
+
+- ✅ **Structured Output**: Responses API provides a clear structure for handling function calls and multi-turn conversations
+- ✅ **Thought Signature Preservation**: LiteLLM automatically preserves thought signatures in both streaming and non-streaming modes
+- ✅ **Seamless Integration**: Works with existing OpenAI SDK patterns
+- ✅ **Full Feature Support**: All Gemini 3 features (thought signatures, function calling, reasoning) are fully supported
+
+
 ## Best Practices
 
 #### 1. Always Include Thought Signatures in Conversation History
@@ -665,6 +946,7 @@ When switching from non-Gemini-3 to Gemini-3:
 - ✅ No manual intervention needed
 - ✅ Conversation history continues seamlessly
 
+
 ## Troubleshooting
 
 #### Issue: Missing Thought Signatures