Skip to content

⚡ Bolt: [performance improvement] optimize material_extractor loops#86

Open
glacy wants to merge 1 commit into
mainfrom
bolt-perf-material-extractor-17224167210372567553
Open

⚡ Bolt: [performance improvement] optimize material_extractor loops#86
glacy wants to merge 1 commit into
mainfrom
bolt-perf-material-extractor-17224167210372567553

Conversation

@glacy
Copy link
Copy Markdown
Owner

@glacy glacy commented Apr 25, 2026

💡 What:

  1. Replaced the $O(N \times M)$ nested loop used for matching solutions to exercises in get_all_exercises with an $O(N + M)$ dictionary lookup (solutions_by_label).
  2. Hoisted topic.lower() into a local variable topic_lower outside of the nested directory iteration loops in extract_by_topic.
  3. Fixed a minor unused variable assignment (cache_entry) that was failing the linter.

🎯 Why:

  1. The previous get_all_exercises implementation used an inner loop with a break to find matching solutions. As the number of exercises and solutions grows, this nested loop becomes an unnecessary performance bottleneck. Building an O(1) access lookup table completely eliminates the inner loop overhead.
  2. In extract_by_topic, calling topic.lower() repeatedly across every material file checked for every iteration of tareas_dir and examenes_dir causes unnecessary Python object allocation and string manipulations during large dataset traversals.

📊 Impact:

  • Reduces the time complexity of exercise-to-solution matching in get_all_exercises from $O(N^2)$ to $O(N)$.
  • Reduces redundant string lowercasing operations in extract_by_topic by a factor directly proportional to the number of materials scanned.

🔬 Measurement:

The optimization was verified by running the tests/ suite using pytest, which confirms there were no logic regressions (the dictionary logic correctly preserves the original first-match behavior of the break statement). Formatted with black and ruff.


PR created automatically by Jules for task 17224167210372567553 started by @glacy

…dant string lowercasing

- Replaced O(N*M) nested loop in `get_all_exercises` with an O(N) lookup dictionary `solutions_by_label`.
- Hoisted `topic.lower()` to `topic_lower` in `extract_by_topic` to avoid redundant string operations in large filesystem loops.
- Replaced unused `cache_entry` variable with `_` in `_is_cache_valid` for linter compliance.
- Applied black/ruff formatting.

Co-authored-by: glacy <1131951+glacy@users.noreply.github.com>
@google-labs-jules
Copy link
Copy Markdown
Contributor

👋 Jules, reporting for duty! I'm here to lend a hand with this pull request.

When you start a review, I'll add a 👀 emoji to each comment to let you know I've read it. I'll focus on feedback directed at me and will do my best to stay out of conversations between you and other bots or reviewers to keep the noise down.

I'll push a commit with your requested changes shortly after. Please note there might be a delay between these steps, but rest assured I'm on the job!

For more direct control, you can switch me to Reactive Mode. When this mode is on, I will only act on comments where you specifically mention me with @jules. You can find this option in the Pull Request section of your global Jules UI settings. You can always switch back!

New to Jules? Learn more at jules.google/docs.


For security, I will only act on instructions from the user who triggered this task.

Copilot AI review requested due to automatic review settings April 25, 2026 18:06
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR focuses on performance-oriented refactors in MaterialExtractor, primarily reducing nested-loop overhead during exercise→solution matching and avoiding repeated string lowercasing during topic-based extraction, plus a small linter-driven cleanup.

Changes:

  • Replaced nested-loop solution matching in get_all_exercises() with an O(1) lookup map (solutions_by_label) while aiming to preserve first-match semantics.
  • Hoisted topic.lower() to topic_lower in extract_by_topic() to avoid repeated lowercasing inside directory iteration loops.
  • Adjusted formatting/quoting and removed an unused local assignment (changed to _) related to cache validation.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.

File Description
evolutia/material_extractor.py Performance refactors for matching and topic filtering; minor cache/lint-related edits.
.jules/bolt.md Adds a “Bolt” note documenting the O(N*M)→O(N) lookup refactor guidance.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +123 to +127
"data": result,
"timestamp": file_path.stat().st_mtime,
}
self._last_scan_timestamp = max(self._last_scan_timestamp, file_path.stat().st_mtime)
self._last_scan_timestamp = max(
self._last_scan_timestamp, file_path.stat().st_mtime
Copy link

Copilot AI Apr 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

file_path.stat().st_mtime is computed twice when caching a successful extraction. Consider storing the mtime in a local variable and reusing it for both the cache entry and _last_scan_timestamp update to avoid duplicate syscalls and keep the value consistent.

Copilot uses AI. Check for mistakes.
# Verificar si el archivo fue modificado
try:
cache_entry = self._file_cache[file_path]
_ = self._file_cache[file_path]
Copy link

Copilot AI Apr 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This lookup result is assigned to _ but never used, and file_path not in self._file_cache is already checked above, so the assignment is redundant. Removing it would simplify the control flow without changing behavior.

Suggested change
_ = self._file_cache[file_path]

Copilot uses AI. Check for mistakes.
Comment on lines 141 to 145
if use_cache:
self._file_cache[file_path] = {
'data': error_result,
'timestamp': time.time() # Usar tiempo actual para archivos que no existen
"data": error_result,
"timestamp": time.time(), # Usar tiempo actual para archivos que no existen
}
Copy link

Copilot AI Apr 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The error-path cache entry stores timestamp: time.time(), but _is_cache_valid() never consults the per-file cached timestamp (and _last_scan_timestamp is only updated on successful parses). If the first call for a file fails, _last_scan_timestamp can remain 0, making file_mtime > _last_scan_timestamp true and causing the cached error result to never be reused—contradicting the intent to avoid repeated failed retries/log spam. Consider validating the cache against the per-file cached timestamp (and/or TTL), and updating _last_scan_timestamp appropriately on error when the file exists.

Copilot uses AI. Check for mistakes.
Comment thread .jules/bolt.md
**Action:** Always pre-compile regexes (`re.compile`) into module-level or class-level constants if they are used repeatedly, especially in tight loops or recursive functions.
## 2026-04-25 - O(N*M) to O(N) lookup refactoring
**Learning:** When replacing nested loops with dictionary lookups for performance in finding matching elements (e.g., matching solutions to exercises), building an O(N) lookup dictionary first is very effective.
**Action:** Use dictionary comprehensions or manual population (like `solutions_by_label = {sol['exercise_label']: sol for sol in material['solutions']}`) to eliminate inner loops when mapping keys to values.
Copy link

Copilot AI Apr 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This guidance suggests using a dict comprehension for the lookup table, but dict comprehensions are "last value wins" for duplicate keys. In cases where multiple solutions share the same exercise_label, that would change semantics compared to the original nested-loop+break (first match wins). Consider updating the note to recommend manual population / setdefault when first-match behavior must be preserved, and only suggest a comprehension when duplicates are impossible or last-match is desired.

Suggested change
**Action:** Use dictionary comprehensions or manual population (like `solutions_by_label = {sol['exercise_label']: sol for sol in material['solutions']}`) to eliminate inner loops when mapping keys to values.
**Action:** Preserve the original duplicate-key semantics when building the lookup: if the nested-loop version used `break` on the first match, populate manually with `setdefault` (for example, `solutions_by_label = {}; for sol in material['solutions']: solutions_by_label.setdefault(sol['exercise_label'], sol)`) so the first match wins. Only use a dict comprehension like `solutions_by_label = {sol['exercise_label']: sol for sol in material['solutions']}` when duplicate labels are impossible or when last-match-wins behavior is desired.

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants