⚡ Bolt: [performance improvement] Optimize solution-matching loops to O(N) map lookups and fix cache bypass#100
Conversation
…ookups Co-authored-by: glacy <1131951+glacy@users.noreply.github.com>
|
👋 Jules, reporting for duty! I'm here to lend a hand with this pull request. When you start a review, I'll add a 👀 emoji to each comment to let you know I've read it. I'll focus on feedback directed at me and will do my best to stay out of conversations between you and other bots or reviewers to keep the noise down. I'll push a commit with your requested changes shortly after. Please note there might be a delay between these steps, but rest assured I'm on the job! For more direct control, you can switch me to Reactive Mode. When this mode is on, I will only act on comments where you specifically mention me with New to Jules? Learn more at jules.google/docs. For security, I will only act on instructions from the user who triggered this task. |
There was a problem hiding this comment.
Pull request overview
This PR aims to improve performance and correctness in the material extraction + RAG indexing pipeline by (1) replacing nested exercise↔solution matching loops with precomputed lookup maps and (2) fixing a caching bypass in MaterialExtractor.extract_from_file so extracted results actually get stored and reused.
Changes:
- Refactors exercise→solution matching in
MaterialExtractor.get_all_exercisesandRAGIndexer.index_materialsfrom nested loops toexercise_label -> solutiondict lookups (preserving first-match semantics). - Fixes
MaterialExtractor.extract_from_fileso successful parses are cached before returning. - Adds a Bolt note documenting the “solution lookup optimization” learning.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.
| File | Description |
|---|---|
| evolutia/rag/rag_indexer.py | Uses a precomputed solutions map during indexing; also reformats code, but currently introduces an embeddings/chunks length mismatch risk when filtering chunks post-embedding. |
| evolutia/material_extractor.py | Ensures extraction results are cached (no premature return) and switches to O(N) solution lookup in get_all_exercises; cache validation logic remains inconsistent with stored timestamps/TTL. |
| .jules/bolt.md | Documents the solution-lookup optimization guideline and first-match-preserving dict population. |
Comments suppressed due to low confidence (1)
evolutia/material_extractor.py:331
_is_cache_validignores the per-file cachedtimestampyou store in_file_cache[file_path]and instead comparesfile_mtimeonly against_last_scan_timestamp. This can serve stale cached content in cases like mtime regressions (e.g., restoring an older file) and makes the stored timestamp effectively unused. Consider comparingfile_path.stat().st_mtimeagainst the cached entry’s timestamp (and TTL if kept) instead of the global last-scan value.
_ = self._file_cache[file_path]
file_mtime = file_path.stat().st_mtime
# Usar el timestamp de escaneo más reciente para verificar
if file_mtime > self._last_scan_timestamp:
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| # Generar embeddings | ||
| embeddings = self._generate_embeddings_batch(chunks) | ||
|
|
||
| # Sincronizar chunks con embeddings (por si se filtraron vacíos en _generate_embeddings_batch) | ||
| # Aunque aquí preferimos filtrar antes para mantener consistencia |
|
|
||
| # Preparar metadatos | ||
| chunk_metadata = {"type": "reading", **metadata} | ||
|
|
||
| # Generar embeddings | ||
| embeddings = self._generate_embeddings_batch(chunks) | ||
|
|
||
| # Sincronizar chunks con embeddings | ||
| valid_indices = [i for i, chunk in enumerate(chunks) if chunk and chunk.strip()] | ||
| chunks = [chunks[i] for i in valid_indices] | ||
|
|
||
| if not chunks: | ||
| logger.warning( | ||
| f"Lectura {metadata.get('title', 'unknown')} no tiene contenido válido para indexar" | ||
| ) | ||
| return [] | ||
|
|
| # TTL del caché en segundos (5 minutos) | ||
| self._cache_ttl = 300 |
| def index_materials(self, materials: List[Dict], analyzer) -> Dict[str, int]: | ||
| """ | ||
| Indexa una lista de materiales. | ||
|
|
||
| Args: |
💡 What:$O(N)$ pre-computed hash map lookup rather than an $O(N \times M)$ nested loop. Also fixed a bug in
Refactored the solution-to-exercise mapping logic in both
MaterialExtractor.get_all_exercisesandRAGIndexer.index_materialsto use anMaterialExtractor.extract_from_filewhere caching logic was being accidentally bypassed due to a prematurereturn.🎯 Why:$O(N \times M)$ algorithm to match each exercise to its respective solution by iterating through the full solutions list continuously. As the document base scales, this nested loop quickly becomes a silent but heavy performance bottleneck. By precomputing a dictionary mapping
When parsing multiple Markdown material files loaded with exercises and solutions, the application used an
exercise_label -> solution, we drastically improve the matching efficiency. Furthermore, we fixed a logic error bypassing caching, enabling heavy disk I/O savings on future scans.📊 Impact:$O(N^2)$ to linear $O(N)$ , saving CPU overhead during generation and indexing. Restores cache persistence logic saving sequential read times.
Reduces exercise/solution relation mapping time scaling from quadratic
🔬 Measurement:
Unit tests for indexing and markdown parsing verify that standard extraction runs flawlessly and correctly pairs exercises to their solutions identically as before.
(Formatted using standard Ruff and Black formatters)
PR created automatically by Jules for task 12688290283806051153 started by @glacy