fix: normalize file-level node IDs to stem-based format (fixes #1033)#1065
fix: normalize file-level node IDs to stem-based format (fixes #1033)#1065adityachaudhary99 wants to merge 4 commits into
Conversation
- Add _file_node_id(path) helper that returns _make_id(_file_stem(path)) - Use _file_node_id for all file-level node IDs instead of _make_id(str(path)) - Update all import resolution targets to reference _file_node_id format - Update extract() legacy remap to handle both old formats - Update tests to use _file_node_id This ensures AST and semantic subagent nodes for the same file use identical node IDs (parent_dir_stem), fixing the split-node bug where one physical file appeared as two disconnected nodes (safishamsi#1033).
- ensure_named_node now always uses stem-qualified IDs - Same fix for superclass/inheritance resolution in walk() - Same fix for C#, Swift, C++, Java base type fallbacks - Removes bare-name fallback that caused cross-file collisions Previously, _make_id(name) (bare, no stem) was used as fallback when _make_id(stem, name) was not in the per-file seen_ids set, causing identically-named entities in different files to produce colliding IDs. This caused the second entity's node to overwrite the first in the NetworkX graph, losing one entity entirely (safishamsi#952).
7b75c77 to
b471a6e
Compare
AST file nodes were ID'd from the full relative path plus extension
(match_script_pipeline_step_py) while semantic subagents follow the
{parent_dir}_{stem} spec (script_pipeline_step), so every file split
into two disconnected ghost nodes.
Fix at the single remap chokepoint in extract(): file node IDs and all
edge endpoints already funnel through the #502 relative-path remap, so
changing that remap to emit _file_node_id (one parent dir, no extension)
converts the node and every referencing edge together - Python, TS, Lua,
C and bash import edges all stay connected. symbol_resolution pre-computes
the canonical form directly (bypassing the remap) so it is synced too.
Per-site conversion (as attempted in #1038/#1065) orphans edges because
it moves the node without the edge targets; the chokepoint approach
avoids that entirely.
Backward compat for existing graphs: graphify extract --force, as the
skill.md spec already documents.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
Thanks for tackling this @adityachaudhary99. I went with a different approach for #1033 and am going to close this one - want to explain why so it's useful feedback. This PR converts the per-language extractor sites individually ( Two concrete gaps in this PR:
The fix I shipped ( Appreciate the effort here - the diagnosis was right, it was the surface area of the fix that was the tricky part. |
Problem
AST extractor file-level node IDs used make_id(str(path)) (e.g., script_pipeline_step_py with extension). The semantic subagent uses {parent_dir}{filename_without_ext} (e.g., script_pipeline_step). This mismatch caused the same file to appear as two unconnected nodes in the knowledge graph.
Fix
Scope
Affected extractors: JS/TS, Python, Go, Rust, Zig, Fortran, Julia, Pascal, Bash, Astro, Svelte, Objective-C, SQL, C/C++/C#, Java, Kotlin, Scala, PHP, Groovy, Markdown, plus import resolution and Pascal unit maps.