Skip to content

⚡ Bolt: [performance improvement] Optimize exercise-solution matching#90

Open
glacy wants to merge 1 commit into
mainfrom
perf-optimize-match-loops-6193627798938315585
Open

⚡ Bolt: [performance improvement] Optimize exercise-solution matching#90
glacy wants to merge 1 commit into
mainfrom
perf-optimize-match-loops-6193627798938315585

Conversation

@glacy
Copy link
Copy Markdown
Owner

@glacy glacy commented Apr 29, 2026

💡 What:
Refactored nested for loops used to match exercises to their corresponding solutions into hash map dictionary lookups in both evolutia/material_extractor.py and evolutia/rag/rag_indexer.py.

🎯 Why:
The previous implementation iterated through all solutions for every exercise within a material to find a matching label. This resulted in an $O(N \times M)$ time complexity (where $N$ is exercises and $M$ is solutions), creating a significant performance bottleneck during bulk processing of large materials. By pre-computing a dictionary, the lookup is optimized to $O(N)$.

📊 Impact:
Reduces the time complexity of exercise-solution matching from $O(N \times M)$ to $O(N + M)$. In local benchmarks with large materials (1000 exercises and 1000 solutions), execution time dropped from ~3.97s to ~0.16s, a ~95% speed improvement.

🔬 Measurement:
Run python -m pytest tests/ -v to verify that existing extractions and indexations pass identically. The logic strictly preserves the original break (first-match) semantics by using if label not in solutions_dict:.


PR created automatically by Jules for task 6193627798938315585 started by @glacy

Refactored nested loops used for matching exercises to their corresponding
solutions in `MaterialExtractor.get_all_exercises` and `RAGIndexer.index_materials`.
The previous O(N*M) complexity is reduced to O(N) by pre-computing a dictionary
of solutions mapped by exercise labels. This preserves first-match semantics.

Co-authored-by: glacy <1131951+glacy@users.noreply.github.com>
Copilot AI review requested due to automatic review settings April 29, 2026 19:18
@google-labs-jules
Copy link
Copy Markdown
Contributor

👋 Jules, reporting for duty! I'm here to lend a hand with this pull request.

When you start a review, I'll add a 👀 emoji to each comment to let you know I've read it. I'll focus on feedback directed at me and will do my best to stay out of conversations between you and other bots or reviewers to keep the noise down.

I'll push a commit with your requested changes shortly after. Please note there might be a delay between these steps, but rest assured I'm on the job!

For more direct control, you can switch me to Reactive Mode. When this mode is on, I will only act on comments where you specifically mention me with @jules. You can find this option in the Pull Request section of your global Jules UI settings. You can always switch back!

New to Jules? Learn more at jules.google/docs.


For security, I will only act on instructions from the user who triggered this task.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Optimizes exercise-to-solution matching in extraction and RAG indexing by replacing nested scans with precomputed lookup dictionaries to reduce runtime on large materials.

Changes:

  • Refactor exercise→solution matching to use a solutions_dict in MaterialExtractor.get_all_exercises and RAGIndexer.index_materials.
  • Add a local benchmark script (perf_test_loops.py) and a Jules “Bolt” learning note documenting the optimization pattern.
  • Update packaging metadata to include an additional test file in evolutia.egg-info/SOURCES.txt.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
perf_test_loops.py Adds a quick benchmark script for measuring extraction performance.
evolutia/rag/rag_indexer.py Uses precomputed solutions_dict for O(1) solution lookup during indexing (plus formatting cleanup).
evolutia/material_extractor.py Uses precomputed solutions_dict for O(1) solution lookup when aggregating exercises (plus formatting cleanup).
evolutia.egg-info/SOURCES.txt Updates source manifest to include tests/test_markdown_parser.py.
.jules/bolt.md Documents the O(N·M) → O(N) lookup refactor pattern.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 112 to 130
return {
'file_path': file_path,
'frontmatter': frontmatter,
'exercises': exercises,
'solutions': solutions,
'content_body': content_body # Exponer contenido para indexación de lecturas
"file_path": file_path,
"frontmatter": frontmatter,
"exercises": exercises,
"solutions": solutions,
"content_body": content_body, # Exponer contenido para indexación de lecturas
}

# Guardar en caché
if use_cache:
self._file_cache[file_path] = {
'data': result,
'timestamp': file_path.stat().st_mtime
"data": result,
"timestamp": file_path.stat().st_mtime,
}
self._last_scan_timestamp = max(self._last_scan_timestamp, file_path.stat().st_mtime)
self._last_scan_timestamp = max(
self._last_scan_timestamp, file_path.stat().st_mtime
)

return result
Copy link

Copilot AI Apr 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

extract_from_file returns the result dict immediately, so the cache-write block below is unreachable and references result (which is never defined). This breaks the file cache and makes the use_cache flag ineffective. Assign the dict to a result variable, update the cache before returning, and remove the unreachable return/dead code path.

Copilot uses AI. Check for mistakes.
Comment thread perf_test_loops.py
Comment on lines +4 to +20
# Create large mock materials
materials = []
for i in range(10):
exercises = [{'label': f'ex_{j}', 'content': 'test', 'resolved_content': 'test'} for j in range(1000)]
solutions = [{'exercise_label': f'ex_{j}', 'label': f'sol_{j}', 'content': 'test', 'resolved_content': 'test'} for j in range(1000)]
materials.append({
'file_path': 'test.md',
'frontmatter': {},
'exercises': exercises,
'solutions': solutions
})

extractor = MaterialExtractor('.')
start = time.time()
for _ in range(10):
extractor.get_all_exercises(materials)
print(f"O(N*M) time: {time.time() - start:.4f}s")
Copy link

Copilot AI Apr 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This benchmark script executes work at import time (no if __name__ == "__main__": guard), which is risky if the file is ever imported by tooling. Also the output label says O(N*M) even though get_all_exercises is now optimized. Consider guarding the entrypoint and renaming/moving the script under a dedicated scripts/ or benchmarks/ location to avoid accidental execution.

Suggested change
# Create large mock materials
materials = []
for i in range(10):
exercises = [{'label': f'ex_{j}', 'content': 'test', 'resolved_content': 'test'} for j in range(1000)]
solutions = [{'exercise_label': f'ex_{j}', 'label': f'sol_{j}', 'content': 'test', 'resolved_content': 'test'} for j in range(1000)]
materials.append({
'file_path': 'test.md',
'frontmatter': {},
'exercises': exercises,
'solutions': solutions
})
extractor = MaterialExtractor('.')
start = time.time()
for _ in range(10):
extractor.get_all_exercises(materials)
print(f"O(N*M) time: {time.time() - start:.4f}s")
def main():
# Create large mock materials
materials = []
for i in range(10):
exercises = [{'label': f'ex_{j}', 'content': 'test', 'resolved_content': 'test'} for j in range(1000)]
solutions = [{'exercise_label': f'ex_{j}', 'label': f'sol_{j}', 'content': 'test', 'resolved_content': 'test'} for j in range(1000)]
materials.append({
'file_path': 'test.md',
'frontmatter': {},
'exercises': exercises,
'solutions': solutions
})
extractor = MaterialExtractor('.')
start = time.time()
for _ in range(10):
extractor.get_all_exercises(materials)
print(f"Benchmark time: {time.time() - start:.4f}s")
if __name__ == "__main__":
main()

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants