Copy-paste these patches to immediately improve response time. Each can be applied independently.
Why: Query rewriting calls the LLM unnecessarily and doubles LLM latency.
Edit: /Users/jasontitus/experiments/wikiedia-conversation/wikipedia-conversation/llm_client.py
Find this (around line 30-40):
def query_rewrite(self, query: str, conversation_history: List[Dict[str, str]]) -> str:
"""Rewrite query based on conversation history"""
logger.debug(f"📝 Query rewrite requested for: '{query}'")
if not conversation_history:
logger.debug(" No conversation history, returning original query")
return queryReplace with:
def query_rewrite(self, query: str, conversation_history: List[Dict[str, str]]) -> str:
"""Rewrite query based on conversation history"""
# OPTIMIZATION: Skip query rewriting to save 2-5 seconds
# Quality impact: Minimal - queries are usually specific enough already
return queryImpact: 2-5 seconds faster ⚡
Why: Processing and generating responses for fewer sources is faster.
Edit: /Users/jasontitus/experiments/wikiedia-conversation/wikipedia-conversation/wikitalk.py
Find this (around line 74-76):
# Retrieve relevant sources using appropriate search method
print("Searching Wikipedia...")
search_method = "embedding" if (self.use_embeddings and self.retriever.faiss_index) else "like"
sources = self.retriever.search(rewritten_query, top_k=5, method=search_method)Replace with:
# Retrieve relevant sources using appropriate search method
print("Searching Wikipedia...")
search_method = "embedding" if (self.use_embeddings and self.retriever.faiss_index) else "like"
sources = self.retriever.search(rewritten_query, top_k=3, method=search_method)Impact: 1-2 seconds faster ⚡
Why: Fuzzy matching on many results is expensive; top FAISS results are usually good.
Edit: /Users/jasontitus/experiments/wikiedia-conversation/wikipedia-conversation/retriever.py
Find this (around line 389-410):
def search(self, query: str, top_k: int = 20, method: str = "embedding") -> List[Dict[str, Any]]:
"""Unified search method
...
"""
start_time = time.time()
if method == "embedding":
results = self.embedding_search(query, RETRIEVAL_TOPK)
else:
results = self.bm25_search(query, RETRIEVAL_TOPK)
# Rerank top results
reranked_results = self.rerank_results(query, results, top_k)
elapsed = time.time() - start_time
logger.info(f"Search completed in {elapsed:.2f}s: '{query}' → {len(reranked_results)} results")
return reranked_resultsReplace with:
def search(self, query: str, top_k: int = 20, method: str = "embedding") -> List[Dict[str, Any]]:
"""Unified search method
...
"""
start_time = time.time()
if method == "embedding":
results = self.embedding_search(query, RETRIEVAL_TOPK)
else:
results = self.bm25_search(query, RETRIEVAL_TOPK)
# OPTIMIZATION: Skip expensive reranking - FAISS results are already good
# just return top_k directly
final_results = results[:top_k]
elapsed = time.time() - start_time
logger.info(f"Search completed in {elapsed:.2f}s: '{query}' → {len(final_results)} results")
return final_resultsImpact: 0.5-1 second faster ⚡
Why: SQL LIKE search is much faster than generating embeddings.
Trade-off: Lower search quality, but often sufficient.
Edit: /Users/jasontitus/experiments/wikiedia-conversation/wikipedia-conversation/wikitalk.py
Find this (around line 19-32):
class WikiTalk:
def __init__(self, use_embeddings=True):
"""Initialize WikiTalk
Args:
use_embeddings: If True, use semantic search; if False, use keyword search
"""
# Initialize with embedding search by default (if available)
self.retriever = HybridRetriever(use_bm25_only=not use_embeddings)Change to:
class WikiTalk:
def __init__(self, use_embeddings=False): # <-- Changed from True to False
"""Initialize WikiTalk
Args:
use_embeddings: If True, use semantic search; if False, use keyword search
"""
# Initialize with keyword search for faster responses
self.retriever = HybridRetriever(use_bm25_only=not use_embeddings)Impact: 1-2 seconds faster ⚡
Why: Shorter conversation history = shorter prompts = faster LLM responses.
Edit: /Users/jasontitus/experiments/wikiedia-conversation/wikipedia-conversation/config.py
Find this (around line 62):
# Retrieval configuration
RETRIEVAL_TOPK = 40
MEMORY_TURNS = 8Replace with:
# Retrieval configuration
RETRIEVAL_TOPK = 40
MEMORY_TURNS = 2 # Reduced from 8 for faster response generationImpact: 0.5-1 second faster ⚡
Apply optimizations 1-3 together for the best balance:
- Disable query rewriting (saves 2-5s)
- Reduce sources to 3 (saves 1-2s)
- Skip reranking (saves 0.5s)
Total expected improvement: 3-8 seconds faster
Command to test:
cd /Users/jasontitus/experiments/wikiedia-conversation/wikipedia-conversation
source py314_venv/bin/activate
time python profile_wikitalk.pyAll these changes are in non-critical paths. To revert:
- Disable query rewriting: Add back the LLM call in
query_rewrite() - Sources to 3: Change back to
top_k=5 - Skip reranking: Add back
rerank_results()call - Keyword search: Change back
use_embeddings=True
❌ Don't modify FAISS/embedding generation itself - it's already optimized ❌ Don't change database schema - it's already indexed ❌ Don't reduce top_k below 3 - search quality degrades too much ❌ Don't disable the database - it's needed for metadata
These are all already fast (< 1 second each).
After applying patches, test with:
cd /Users/jasontitus/experiments/wikiedia-conversation/wikipedia-conversation
source py314_venv/bin/activate
python wikitalk.pyAsk a question and watch the Processed in X.XXs time.
Target: < 3 seconds per query
Ideal: < 2 seconds per query