Skip to content

⚡ Bolt: [performance improvement] Optimize solution-matching loops to O(N) map lookups and fix cache bypass#100

Open
glacy wants to merge 1 commit into
mainfrom
bolt/optimize-solution-matching-12688290283806051153
Open

⚡ Bolt: [performance improvement] Optimize solution-matching loops to O(N) map lookups and fix cache bypass#100
glacy wants to merge 1 commit into
mainfrom
bolt/optimize-solution-matching-12688290283806051153

Conversation

@glacy
Copy link
Copy Markdown
Owner

@glacy glacy commented May 12, 2026

💡 What:
Refactored the solution-to-exercise mapping logic in both MaterialExtractor.get_all_exercises and RAGIndexer.index_materials to use an $O(N)$ pre-computed hash map lookup rather than an $O(N \times M)$ nested loop. Also fixed a bug in MaterialExtractor.extract_from_file where caching logic was being accidentally bypassed due to a premature return.

🎯 Why:
When parsing multiple Markdown material files loaded with exercises and solutions, the application used an $O(N \times M)$ algorithm to match each exercise to its respective solution by iterating through the full solutions list continuously. As the document base scales, this nested loop quickly becomes a silent but heavy performance bottleneck. By precomputing a dictionary mapping exercise_label -> solution, we drastically improve the matching efficiency. Furthermore, we fixed a logic error bypassing caching, enabling heavy disk I/O savings on future scans.

📊 Impact:
Reduces exercise/solution relation mapping time scaling from quadratic $O(N^2)$ to linear $O(N)$, saving CPU overhead during generation and indexing. Restores cache persistence logic saving sequential read times.

🔬 Measurement:
Unit tests for indexing and markdown parsing verify that standard extraction runs flawlessly and correctly pairs exercises to their solutions identically as before.

(Formatted using standard Ruff and Black formatters)


PR created automatically by Jules for task 12688290283806051153 started by @glacy

…ookups

Co-authored-by: glacy <1131951+glacy@users.noreply.github.com>
Copilot AI review requested due to automatic review settings May 12, 2026 18:37
@google-labs-jules
Copy link
Copy Markdown
Contributor

👋 Jules, reporting for duty! I'm here to lend a hand with this pull request.

When you start a review, I'll add a 👀 emoji to each comment to let you know I've read it. I'll focus on feedback directed at me and will do my best to stay out of conversations between you and other bots or reviewers to keep the noise down.

I'll push a commit with your requested changes shortly after. Please note there might be a delay between these steps, but rest assured I'm on the job!

For more direct control, you can switch me to Reactive Mode. When this mode is on, I will only act on comments where you specifically mention me with @jules. You can find this option in the Pull Request section of your global Jules UI settings. You can always switch back!

New to Jules? Learn more at jules.google/docs.


For security, I will only act on instructions from the user who triggered this task.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR aims to improve performance and correctness in the material extraction + RAG indexing pipeline by (1) replacing nested exercise↔solution matching loops with precomputed lookup maps and (2) fixing a caching bypass in MaterialExtractor.extract_from_file so extracted results actually get stored and reused.

Changes:

  • Refactors exercise→solution matching in MaterialExtractor.get_all_exercises and RAGIndexer.index_materials from nested loops to exercise_label -> solution dict lookups (preserving first-match semantics).
  • Fixes MaterialExtractor.extract_from_file so successful parses are cached before returning.
  • Adds a Bolt note documenting the “solution lookup optimization” learning.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.

File Description
evolutia/rag/rag_indexer.py Uses a precomputed solutions map during indexing; also reformats code, but currently introduces an embeddings/chunks length mismatch risk when filtering chunks post-embedding.
evolutia/material_extractor.py Ensures extraction results are cached (no premature return) and switches to O(N) solution lookup in get_all_exercises; cache validation logic remains inconsistent with stored timestamps/TTL.
.jules/bolt.md Documents the solution-lookup optimization guideline and first-match-preserving dict population.
Comments suppressed due to low confidence (1)

evolutia/material_extractor.py:331

  • _is_cache_valid ignores the per-file cached timestamp you store in _file_cache[file_path] and instead compares file_mtime only against _last_scan_timestamp. This can serve stale cached content in cases like mtime regressions (e.g., restoring an older file) and makes the stored timestamp effectively unused. Consider comparing file_path.stat().st_mtime against the cached entry’s timestamp (and TTL if kept) instead of the global last-scan value.
            _ = self._file_cache[file_path]
            file_mtime = file_path.stat().st_mtime

            # Usar el timestamp de escaneo más reciente para verificar
            if file_mtime > self._last_scan_timestamp:

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +288 to +292
# Generar embeddings
embeddings = self._generate_embeddings_batch(chunks)

# Sincronizar chunks con embeddings (por si se filtraron vacíos en _generate_embeddings_batch)
# Aunque aquí preferimos filtrar antes para mantener consistencia
Comment on lines +344 to +360

# Preparar metadatos
chunk_metadata = {"type": "reading", **metadata}

# Generar embeddings
embeddings = self._generate_embeddings_batch(chunks)

# Sincronizar chunks con embeddings
valid_indices = [i for i, chunk in enumerate(chunks) if chunk and chunk.strip()]
chunks = [chunks[i] for i in valid_indices]

if not chunks:
logger.warning(
f"Lectura {metadata.get('title', 'unknown')} no tiene contenido válido para indexar"
)
return []

Comment on lines 49 to 50
# TTL del caché en segundos (5 minutos)
self._cache_ttl = 300
Comment on lines +387 to +391
def index_materials(self, materials: List[Dict], analyzer) -> Dict[str, int]:
"""
Indexa una lista de materiales.

Args:
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants