-
Notifications
You must be signed in to change notification settings - Fork 4
learn tool hangs on large codebases with many folders/subfolders #1
Description
Problem
The learn MCP tool gets stuck or appears frozen when indexing large codebases with many folders and subfolders.
Root Causes
-
No
.gitignoresupport —discover_files()usesPath.rglob("*")with only a hardcodedSKIP_DIRSset. Nestednode_modules, build artifacts, and generated files all get traversed and indexed, causing 10x+ slowdown on real projects. -
No checkpointing — Metadata is saved only after ALL embeddings complete. An interruption at chunk 99K of 100K loses all progress and requires a full restart.
-
Destructive full-index —
index_codebase()deletes the entire ChromaDB collection before re-creating it, preventing any form of resumption. -
Weak progress reporting — File discovery phase emits zero progress events. Embedding progress lacks ETA, making the tool appear frozen.
Proposed Fix
- Rewrite
discover_files()to useos.walk()withpathspecfor.gitignoresupport and directory pruning - Add
max_filessafety limit (addresses SECURITY_REVIEW HIGH-003: Unbounded Resource Consumption) - Switch from
delete_collection()+create_collection()toget_or_create_collection()+upsert() - Add resumable checkpointing with atomic writes every 1000 chunks
- Improve progress reporting with discovery events and ETA
Impact
- No changes to MCP tool interface (
learnparams stay the same) - Backward compatible (all new params have defaults)
- Also addresses security findings HIGH-003 and LOW-009