Skip to content

learn tool hangs on large codebases with many folders/subfolders #1

@jaggernaut007

Description

@jaggernaut007

Problem

The learn MCP tool gets stuck or appears frozen when indexing large codebases with many folders and subfolders.

Root Causes

  1. No .gitignore supportdiscover_files() uses Path.rglob("*") with only a hardcoded SKIP_DIRS set. Nested node_modules, build artifacts, and generated files all get traversed and indexed, causing 10x+ slowdown on real projects.

  2. No checkpointing — Metadata is saved only after ALL embeddings complete. An interruption at chunk 99K of 100K loses all progress and requires a full restart.

  3. Destructive full-indexindex_codebase() deletes the entire ChromaDB collection before re-creating it, preventing any form of resumption.

  4. Weak progress reporting — File discovery phase emits zero progress events. Embedding progress lacks ETA, making the tool appear frozen.

Proposed Fix

  • Rewrite discover_files() to use os.walk() with pathspec for .gitignore support and directory pruning
  • Add max_files safety limit (addresses SECURITY_REVIEW HIGH-003: Unbounded Resource Consumption)
  • Switch from delete_collection()+create_collection() to get_or_create_collection()+upsert()
  • Add resumable checkpointing with atomic writes every 1000 chunks
  • Improve progress reporting with discovery events and ETA

Impact

  • No changes to MCP tool interface (learn params stay the same)
  • Backward compatible (all new params have defaults)
  • Also addresses security findings HIGH-003 and LOW-009

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions