⚡ Bolt: [performance improvement] Optimize exercise-solution matching by glacy · Pull Request #90 · glacy/evolutIA

glacy · 2026-04-29T19:18:07Z

💡 What:
Refactored nested for loops used to match exercises to their corresponding solutions into hash map dictionary lookups in both evolutia/material_extractor.py and evolutia/rag/rag_indexer.py.

🎯 Why:
The previous implementation iterated through all solutions for every exercise within a material to find a matching label. This resulted in an $O(N \times M)$ time complexity (where $N$ is exercises and $M$ is solutions), creating a significant performance bottleneck during bulk processing of large materials. By pre-computing a dictionary, the lookup is optimized to $O(N)$.

📊 Impact:
Reduces the time complexity of exercise-solution matching from $O(N \times M)$ to $O(N + M)$. In local benchmarks with large materials (1000 exercises and 1000 solutions), execution time dropped from ~3.97s to ~0.16s, a ~95% speed improvement.

🔬 Measurement:
Run python -m pytest tests/ -v to verify that existing extractions and indexations pass identically. The logic strictly preserves the original break (first-match) semantics by using if label not in solutions_dict:.

PR created automatically by Jules for task 6193627798938315585 started by @glacy

Refactored nested loops used for matching exercises to their corresponding solutions in `MaterialExtractor.get_all_exercises` and `RAGIndexer.index_materials`. The previous O(N*M) complexity is reduced to O(N) by pre-computing a dictionary of solutions mapped by exercise labels. This preserves first-match semantics. Co-authored-by: glacy <1131951+glacy@users.noreply.github.com>

google-labs-jules · 2026-04-29T19:18:08Z

👋 Jules, reporting for duty! I'm here to lend a hand with this pull request.

When you start a review, I'll add a 👀 emoji to each comment to let you know I've read it. I'll focus on feedback directed at me and will do my best to stay out of conversations between you and other bots or reviewers to keep the noise down.

I'll push a commit with your requested changes shortly after. Please note there might be a delay between these steps, but rest assured I'm on the job!

For more direct control, you can switch me to Reactive Mode. When this mode is on, I will only act on comments where you specifically mention me with @jules. You can find this option in the Pull Request section of your global Jules UI settings. You can always switch back!

New to Jules? Learn more at jules.google/docs.

For security, I will only act on instructions from the user who triggered this task.

Copilot

Pull request overview

Optimizes exercise-to-solution matching in extraction and RAG indexing by replacing nested scans with precomputed lookup dictionaries to reduce runtime on large materials.

Changes:

Refactor exercise→solution matching to use a solutions_dict in MaterialExtractor.get_all_exercises and RAGIndexer.index_materials.
Add a local benchmark script (perf_test_loops.py) and a Jules “Bolt” learning note documenting the optimization pattern.
Update packaging metadata to include an additional test file in evolutia.egg-info/SOURCES.txt.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
perf_test_loops.py	Adds a quick benchmark script for measuring extraction performance.
evolutia/rag/rag_indexer.py	Uses precomputed `solutions_dict` for O(1) solution lookup during indexing (plus formatting cleanup).
evolutia/material_extractor.py	Uses precomputed `solutions_dict` for O(1) solution lookup when aggregating exercises (plus formatting cleanup).
evolutia.egg-info/SOURCES.txt	Updates source manifest to include `tests/test_markdown_parser.py`.
.jules/bolt.md	Documents the O(N·M) → O(N) lookup refactor pattern.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-29T19:21:37Z

            return {
-                'file_path': file_path,
-                'frontmatter': frontmatter,
-                'exercises': exercises,
-                'solutions': solutions,
-                'content_body': content_body  # Exponer contenido para indexación de lecturas
+                "file_path": file_path,
+                "frontmatter": frontmatter,
+                "exercises": exercises,
+                "solutions": solutions,
+                "content_body": content_body,  # Exponer contenido para indexación de lecturas
            }

            # Guardar en caché
            if use_cache:
                self._file_cache[file_path] = {
-                    'data': result,
-                    'timestamp': file_path.stat().st_mtime
+                    "data": result,
+                    "timestamp": file_path.stat().st_mtime,
                }
-                self._last_scan_timestamp = max(self._last_scan_timestamp, file_path.stat().st_mtime)
+                self._last_scan_timestamp = max(
+                    self._last_scan_timestamp, file_path.stat().st_mtime
+                )

            return result


extract_from_file returns the result dict immediately, so the cache-write block below is unreachable and references result (which is never defined). This breaks the file cache and makes the use_cache flag ineffective. Assign the dict to a result variable, update the cache before returning, and remove the unreachable return/dead code path.

Copilot · 2026-04-29T19:21:37Z

+# Create large mock materials
+materials = []
+for i in range(10):
+    exercises = [{'label': f'ex_{j}', 'content': 'test', 'resolved_content': 'test'} for j in range(1000)]
+    solutions = [{'exercise_label': f'ex_{j}', 'label': f'sol_{j}', 'content': 'test', 'resolved_content': 'test'} for j in range(1000)]
+    materials.append({
+        'file_path': 'test.md',
+        'frontmatter': {},
+        'exercises': exercises,
+        'solutions': solutions
+    })
+
+extractor = MaterialExtractor('.')
+start = time.time()
+for _ in range(10):
+    extractor.get_all_exercises(materials)
+print(f"O(N*M) time: {time.time() - start:.4f}s")


This benchmark script executes work at import time (no if __name__ == "__main__": guard), which is risky if the file is ever imported by tooling. Also the output label says O(N*M) even though get_all_exercises is now optimized. Consider guarding the entrypoint and renaming/moving the script under a dedicated scripts/ or benchmarks/ location to avoid accidental execution.

Suggested change

# Create large mock materials

materials = []

for i in range(10):

exercises = [{'label': f'ex_{j}', 'content': 'test', 'resolved_content': 'test'} for j in range(1000)]

solutions = [{'exercise_label': f'ex_{j}', 'label': f'sol_{j}', 'content': 'test', 'resolved_content': 'test'} for j in range(1000)]

materials.append({

'file_path': 'test.md',

'frontmatter': {},

'exercises': exercises,

'solutions': solutions

})

extractor = MaterialExtractor('.')

start = time.time()

for _ in range(10):

extractor.get_all_exercises(materials)

print(f"O(N*M) time: {time.time() - start:.4f}s")

def main():

# Create large mock materials

materials = []

for i in range(10):

exercises = [{'label': f'ex_{j}', 'content': 'test', 'resolved_content': 'test'} for j in range(1000)]

solutions = [{'exercise_label': f'ex_{j}', 'label': f'sol_{j}', 'content': 'test', 'resolved_content': 'test'} for j in range(1000)]

materials.append({

'file_path': 'test.md',

'frontmatter': {},

'exercises': exercises,

'solutions': solutions

})

extractor = MaterialExtractor('.')

start = time.time()

for _ in range(10):

extractor.get_all_exercises(materials)

print(f"Benchmark time: {time.time() - start:.4f}s")

if __name__ == "__main__":

main()

Copilot AI review requested due to automatic review settings April 29, 2026 19:18

Copilot started reviewing on behalf of glacy April 29, 2026 19:18 View session

Copilot AI reviewed Apr 29, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

⚡ Bolt: [performance improvement] Optimize exercise-solution matching#90

⚡ Bolt: [performance improvement] Optimize exercise-solution matching#90
glacy wants to merge 1 commit into
mainfrom
perf-optimize-match-loops-6193627798938315585

glacy commented Apr 29, 2026

Uh oh!

google-labs-jules Bot commented Apr 29, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 29, 2026

Uh oh!

Copilot AI Apr 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

glacy commented Apr 29, 2026

Uh oh!

google-labs-jules Bot commented Apr 29, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 29, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 29, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants