⚡ Bolt: [performance improvement] optimize material_extractor loops#86
⚡ Bolt: [performance improvement] optimize material_extractor loops#86glacy wants to merge 1 commit into
Conversation
…dant string lowercasing - Replaced O(N*M) nested loop in `get_all_exercises` with an O(N) lookup dictionary `solutions_by_label`. - Hoisted `topic.lower()` to `topic_lower` in `extract_by_topic` to avoid redundant string operations in large filesystem loops. - Replaced unused `cache_entry` variable with `_` in `_is_cache_valid` for linter compliance. - Applied black/ruff formatting. Co-authored-by: glacy <1131951+glacy@users.noreply.github.com>
|
👋 Jules, reporting for duty! I'm here to lend a hand with this pull request. When you start a review, I'll add a 👀 emoji to each comment to let you know I've read it. I'll focus on feedback directed at me and will do my best to stay out of conversations between you and other bots or reviewers to keep the noise down. I'll push a commit with your requested changes shortly after. Please note there might be a delay between these steps, but rest assured I'm on the job! For more direct control, you can switch me to Reactive Mode. When this mode is on, I will only act on comments where you specifically mention me with New to Jules? Learn more at jules.google/docs. For security, I will only act on instructions from the user who triggered this task. |
There was a problem hiding this comment.
Pull request overview
This PR focuses on performance-oriented refactors in MaterialExtractor, primarily reducing nested-loop overhead during exercise→solution matching and avoiding repeated string lowercasing during topic-based extraction, plus a small linter-driven cleanup.
Changes:
- Replaced nested-loop solution matching in
get_all_exercises()with an O(1) lookup map (solutions_by_label) while aiming to preserve first-match semantics. - Hoisted
topic.lower()totopic_lowerinextract_by_topic()to avoid repeated lowercasing inside directory iteration loops. - Adjusted formatting/quoting and removed an unused local assignment (changed to
_) related to cache validation.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.
| File | Description |
|---|---|
| evolutia/material_extractor.py | Performance refactors for matching and topic filtering; minor cache/lint-related edits. |
| .jules/bolt.md | Adds a “Bolt” note documenting the O(N*M)→O(N) lookup refactor guidance. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| "data": result, | ||
| "timestamp": file_path.stat().st_mtime, | ||
| } | ||
| self._last_scan_timestamp = max(self._last_scan_timestamp, file_path.stat().st_mtime) | ||
| self._last_scan_timestamp = max( | ||
| self._last_scan_timestamp, file_path.stat().st_mtime |
There was a problem hiding this comment.
file_path.stat().st_mtime is computed twice when caching a successful extraction. Consider storing the mtime in a local variable and reusing it for both the cache entry and _last_scan_timestamp update to avoid duplicate syscalls and keep the value consistent.
| # Verificar si el archivo fue modificado | ||
| try: | ||
| cache_entry = self._file_cache[file_path] | ||
| _ = self._file_cache[file_path] |
There was a problem hiding this comment.
This lookup result is assigned to _ but never used, and file_path not in self._file_cache is already checked above, so the assignment is redundant. Removing it would simplify the control flow without changing behavior.
| _ = self._file_cache[file_path] |
| if use_cache: | ||
| self._file_cache[file_path] = { | ||
| 'data': error_result, | ||
| 'timestamp': time.time() # Usar tiempo actual para archivos que no existen | ||
| "data": error_result, | ||
| "timestamp": time.time(), # Usar tiempo actual para archivos que no existen | ||
| } |
There was a problem hiding this comment.
The error-path cache entry stores timestamp: time.time(), but _is_cache_valid() never consults the per-file cached timestamp (and _last_scan_timestamp is only updated on successful parses). If the first call for a file fails, _last_scan_timestamp can remain 0, making file_mtime > _last_scan_timestamp true and causing the cached error result to never be reused—contradicting the intent to avoid repeated failed retries/log spam. Consider validating the cache against the per-file cached timestamp (and/or TTL), and updating _last_scan_timestamp appropriately on error when the file exists.
| **Action:** Always pre-compile regexes (`re.compile`) into module-level or class-level constants if they are used repeatedly, especially in tight loops or recursive functions. | ||
| ## 2026-04-25 - O(N*M) to O(N) lookup refactoring | ||
| **Learning:** When replacing nested loops with dictionary lookups for performance in finding matching elements (e.g., matching solutions to exercises), building an O(N) lookup dictionary first is very effective. | ||
| **Action:** Use dictionary comprehensions or manual population (like `solutions_by_label = {sol['exercise_label']: sol for sol in material['solutions']}`) to eliminate inner loops when mapping keys to values. |
There was a problem hiding this comment.
This guidance suggests using a dict comprehension for the lookup table, but dict comprehensions are "last value wins" for duplicate keys. In cases where multiple solutions share the same exercise_label, that would change semantics compared to the original nested-loop+break (first match wins). Consider updating the note to recommend manual population / setdefault when first-match behavior must be preserved, and only suggest a comprehension when duplicates are impossible or last-match is desired.
| **Action:** Use dictionary comprehensions or manual population (like `solutions_by_label = {sol['exercise_label']: sol for sol in material['solutions']}`) to eliminate inner loops when mapping keys to values. | |
| **Action:** Preserve the original duplicate-key semantics when building the lookup: if the nested-loop version used `break` on the first match, populate manually with `setdefault` (for example, `solutions_by_label = {}; for sol in material['solutions']: solutions_by_label.setdefault(sol['exercise_label'], sol)`) so the first match wins. Only use a dict comprehension like `solutions_by_label = {sol['exercise_label']: sol for sol in material['solutions']}` when duplicate labels are impossible or when last-match-wins behavior is desired. |
💡 What:
get_all_exerciseswith ansolutions_by_label).topic.lower()into a local variabletopic_loweroutside of the nested directory iteration loops inextract_by_topic.cache_entry) that was failing the linter.🎯 Why:
get_all_exercisesimplementation used an inner loop with abreakto find matching solutions. As the number of exercises and solutions grows, this nested loop becomes an unnecessary performance bottleneck. Building an O(1) access lookup table completely eliminates the inner loop overhead.extract_by_topic, callingtopic.lower()repeatedly across every material file checked for every iteration oftareas_dirandexamenes_dircauses unnecessary Python object allocation and string manipulations during large dataset traversals.📊 Impact:
get_all_exercisesfromextract_by_topicby a factor directly proportional to the number of materials scanned.🔬 Measurement:
The optimization was verified by running the
tests/suite using pytest, which confirms there were no logic regressions (the dictionary logic correctly preserves the original first-match behavior of thebreakstatement). Formatted withblackandruff.PR created automatically by Jules for task 17224167210372567553 started by @glacy