⚡ Bolt: [performance improvement] Optimize solution-matching loops to O(N) map lookups and fix cache bypass by glacy · Pull Request #100 · glacy/evolutIA

glacy · 2026-05-12T18:37:50Z

💡 What:
Refactored the solution-to-exercise mapping logic in both MaterialExtractor.get_all_exercises and RAGIndexer.index_materials to use an $O(N)$ pre-computed hash map lookup rather than an $O(N \times M)$ nested loop. Also fixed a bug in MaterialExtractor.extract_from_file where caching logic was being accidentally bypassed due to a premature return.

🎯 Why:
When parsing multiple Markdown material files loaded with exercises and solutions, the application used an $O(N \times M)$ algorithm to match each exercise to its respective solution by iterating through the full solutions list continuously. As the document base scales, this nested loop quickly becomes a silent but heavy performance bottleneck. By precomputing a dictionary mapping exercise_label -> solution, we drastically improve the matching efficiency. Furthermore, we fixed a logic error bypassing caching, enabling heavy disk I/O savings on future scans.

📊 Impact:
Reduces exercise/solution relation mapping time scaling from quadratic $O(N^2)$ to linear $O(N)$, saving CPU overhead during generation and indexing. Restores cache persistence logic saving sequential read times.

🔬 Measurement:
Unit tests for indexing and markdown parsing verify that standard extraction runs flawlessly and correctly pairs exercises to their solutions identically as before.

(Formatted using standard Ruff and Black formatters)

PR created automatically by Jules for task 12688290283806051153 started by @glacy

…ookups Co-authored-by: glacy <1131951+glacy@users.noreply.github.com>

google-labs-jules · 2026-05-12T18:37:51Z

👋 Jules, reporting for duty! I'm here to lend a hand with this pull request.

When you start a review, I'll add a 👀 emoji to each comment to let you know I've read it. I'll focus on feedback directed at me and will do my best to stay out of conversations between you and other bots or reviewers to keep the noise down.

I'll push a commit with your requested changes shortly after. Please note there might be a delay between these steps, but rest assured I'm on the job!

For more direct control, you can switch me to Reactive Mode. When this mode is on, I will only act on comments where you specifically mention me with @jules. You can find this option in the Pull Request section of your global Jules UI settings. You can always switch back!

New to Jules? Learn more at jules.google/docs.

For security, I will only act on instructions from the user who triggered this task.

Copilot

Pull request overview

This PR aims to improve performance and correctness in the material extraction + RAG indexing pipeline by (1) replacing nested exercise↔solution matching loops with precomputed lookup maps and (2) fixing a caching bypass in MaterialExtractor.extract_from_file so extracted results actually get stored and reused.

Changes:

Refactors exercise→solution matching in MaterialExtractor.get_all_exercises and RAGIndexer.index_materials from nested loops to exercise_label -> solution dict lookups (preserving first-match semantics).
Fixes MaterialExtractor.extract_from_file so successful parses are cached before returning.
Adds a Bolt note documenting the “solution lookup optimization” learning.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.

File	Description
evolutia/rag/rag_indexer.py	Uses a precomputed solutions map during indexing; also reformats code, but currently introduces an embeddings/chunks length mismatch risk when filtering chunks post-embedding.
evolutia/material_extractor.py	Ensures extraction results are cached (no premature return) and switches to O(N) solution lookup in `get_all_exercises`; cache validation logic remains inconsistent with stored timestamps/TTL.
.jules/bolt.md	Documents the solution-lookup optimization guideline and first-match-preserving dict population.

Comments suppressed due to low confidence (1)

evolutia/material_extractor.py:331

_is_cache_valid ignores the per-file cached timestamp you store in _file_cache[file_path] and instead compares file_mtime only against _last_scan_timestamp. This can serve stale cached content in cases like mtime regressions (e.g., restoring an older file) and makes the stored timestamp effectively unused. Consider comparing file_path.stat().st_mtime against the cached entry’s timestamp (and TTL if kept) instead of the global last-scan value.

            _ = self._file_cache[file_path]
            file_mtime = file_path.stat().st_mtime

            # Usar el timestamp de escaneo más reciente para verificar
            if file_mtime > self._last_scan_timestamp:

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+        # Generar embeddings
+        embeddings = self._generate_embeddings_batch(chunks)
+
+        # Sincronizar chunks con embeddings (por si se filtraron vacíos en _generate_embeddings_batch)
+        # Aunque aquí preferimos filtrar antes para mantener consistencia


+
+        # Preparar metadatos
+        chunk_metadata = {"type": "reading", **metadata}
+
+        # Generar embeddings
+        embeddings = self._generate_embeddings_batch(chunks)
+
+        # Sincronizar chunks con embeddings
+        valid_indices = [i for i, chunk in enumerate(chunks) if chunk and chunk.strip()]
+        chunks = [chunks[i] for i in valid_indices]
+
+        if not chunks:
+            logger.warning(
+                f"Lectura {metadata.get('title', 'unknown')} no tiene contenido válido para indexar"
+            )
+            return []
+


        # TTL del caché en segundos (5 minutos)
        self._cache_ttl = 300


+    def index_materials(self, materials: List[Dict], analyzer) -> Dict[str, int]:
+        """
+        Indexa una lista de materiales.
+
+        Args:


refactor: optimize solution matching from O(N*M) to O(N) using dict l…

9de2f16

…ookups Co-authored-by: glacy <1131951+glacy@users.noreply.github.com>

Copilot AI review requested due to automatic review settings May 12, 2026 18:37

Copilot started reviewing on behalf of glacy May 12, 2026 18:38 View session

Copilot AI reviewed May 12, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

⚡ Bolt: [performance improvement] Optimize solution-matching loops to O(N) map lookups and fix cache bypass#100

⚡ Bolt: [performance improvement] Optimize solution-matching loops to O(N) map lookups and fix cache bypass#100
glacy wants to merge 1 commit into
mainfrom
bolt/optimize-solution-matching-12688290283806051153

glacy commented May 12, 2026

Uh oh!

google-labs-jules Bot commented May 12, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		# TTL del caché en segundos (5 minutos)
		self._cache_ttl = 300

Conversation

glacy commented May 12, 2026

Uh oh!

google-labs-jules Bot commented May 12, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants