Ultra-low-latency LLM gateway with microsecond caching, dynamic routing, budgets, analytics, and forecasting.
-
Updated
Apr 2, 2026 - Go
Ultra-low-latency LLM gateway with microsecond caching, dynamic routing, budgets, analytics, and forecasting.
Official implementation of "SentenceKV: Efficient LLM Inference via Sentence-Level Semantic KV Caching" (COLM 2025). A novel KV cache compression method that organizes cache at sentence level using semantic similarity.
This app leverages Semantic Caching to minimize inference latency and reduce API costs by reusing semantically similar prompt responses.
Semantic caching demo with real-time streaming and a cost & sizing calculator, powered by Azure Managed Redis and Azure OpenAI.
Rust Local Token Compression Proxy for coding agents, built solo for GenAI Genesis 2026. 🏆 1st Google Sustainability Hack
Evaluate how a semantic cache performs on your dataset by computing key KPIs over a threshold sweep and producing plots/CSVs:
Semantic caching for LLM responses using Redis Vector DB, LangChain, and HuggingFace embeddings, parses PDFs, generates FAQs with Groq, and serves similarity-based answers without redundant LLM calls.
Semantic memory and caching for LLM agents with classifier-validated equivalence instead of naive cosine thresholds.
Production-grade Java 25 Virtual Thread inference gateway bridging NVIDIA Triton → Dynamo with Earliest Deadline First (EDF) priority queuing, adaptive batching, and async shadow validation.
A systems research platform for semantic KV-cache orchestration, topology-aware memory placement, distributed prefix reuse, and rack-scale inference memory simulation.
High-performance LLM observability and evaluation platform with automated instrumentation, stateful chat orchestration, semantic vector memory caching, and scheduled Temporal workers for cost anomaly detection.
LLM cost monitoring and optimization toolkit
LLMOps API Gateway in Go. Optimizes GenAI workloads with Qdrant semantic caching, Redis rate-limiting, and OpenTelemetry metrics.
📊 A FastAPI RAG pipeline exploring Redis/Valkey observability with BetterDB — semantic caching, rate limiting, and latency attribution with MCP-powered debugging.
A high-performance, open-source portfolio built with React 19, featuring an intelligent AI chat assistant with multi-LLM failover, semantic caching, and a 3D-integrated UI.
Multi-agent content pipeline with LangGraph, FastAPI, and Redis semantic caching
A lightweight and high-performance API gateway for large language models, through intelligent routing and semantic caching, can significantly reduce token costs
Simple RAG implementation with semantic caching using Redis and Langchain
Add a description, image, and links to the semantic-caching topic page so that developers can more easily learn about it.
To associate your repository with the semantic-caching topic, visit your repo's landing page and select "manage topics."