Skip to content

fix: normalize file-level node IDs to stem-based format (fixes #1033)#1065

Closed
adityachaudhary99 wants to merge 4 commits into
safishamsi:v8from
adityachaudhary99:fix/node-id-mismatch
Closed

fix: normalize file-level node IDs to stem-based format (fixes #1033)#1065
adityachaudhary99 wants to merge 4 commits into
safishamsi:v8from
adityachaudhary99:fix/node-id-mismatch

Conversation

@adityachaudhary99
Copy link
Copy Markdown
Contributor

Problem

AST extractor file-level node IDs used make_id(str(path)) (e.g., script_pipeline_step_py with extension). The semantic subagent uses {parent_dir}{filename_without_ext} (e.g., script_pipeline_step). This mismatch caused the same file to appear as two unconnected nodes in the knowledge graph.

Fix

  • Added _file_node_id(path) helper -> _make_id(_file_stem(path)) matching subagent format
  • Replaced all ~52 _make_id(str(path)) file-level ID calls across extractors, JS resolution, import targets, and extract() remap
  • Changed extract() legacy ID remap fallback from rel = path to continue — files outside root are silently skipped instead of using stale absolute paths
  • Added missing imports for extract_svelte and _resolve_js_module_path in test file

Scope

Affected extractors: JS/TS, Python, Go, Rust, Zig, Fortran, Julia, Pascal, Bash, Astro, Svelte, Objective-C, SQL, C/C++/C#, Java, Kotlin, Scala, PHP, Groovy, Markdown, plus import resolution and Pascal unit maps.

- Add _file_node_id(path) helper that returns _make_id(_file_stem(path))
- Use _file_node_id for all file-level node IDs instead of _make_id(str(path))
- Update all import resolution targets to reference _file_node_id format
- Update extract() legacy remap to handle both old formats
- Update tests to use _file_node_id

This ensures AST and semantic subagent nodes for the same file use
identical node IDs (parent_dir_stem), fixing the split-node bug where
one physical file appeared as two disconnected nodes (safishamsi#1033).
- ensure_named_node now always uses stem-qualified IDs
- Same fix for superclass/inheritance resolution in walk()
- Same fix for C#, Swift, C++, Java base type fallbacks
- Removes bare-name fallback that caused cross-file collisions

Previously, _make_id(name) (bare, no stem) was used as fallback when
_make_id(stem, name) was not in the per-file seen_ids set, causing
identically-named entities in different files to produce colliding IDs.
This caused the second entity's node to overwrite the first in the
NetworkX graph, losing one entity entirely (safishamsi#952).
safishamsi added a commit that referenced this pull request May 31, 2026
AST file nodes were ID'd from the full relative path plus extension
(match_script_pipeline_step_py) while semantic subagents follow the
{parent_dir}_{stem} spec (script_pipeline_step), so every file split
into two disconnected ghost nodes.

Fix at the single remap chokepoint in extract(): file node IDs and all
edge endpoints already funnel through the #502 relative-path remap, so
changing that remap to emit _file_node_id (one parent dir, no extension)
converts the node and every referencing edge together - Python, TS, Lua,
C and bash import edges all stay connected. symbol_resolution pre-computes
the canonical form directly (bypassing the remap) so it is synced too.

Per-site conversion (as attempted in #1038/#1065) orphans edges because
it moves the node without the edge targets; the chokepoint approach
avoids that entirely.

Backward compat for existing graphs: graphify extract --force, as the
skill.md spec already documents.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@safishamsi
Copy link
Copy Markdown
Owner

Thanks for tackling this @adityachaudhary99. I went with a different approach for #1033 and am going to close this one - want to explain why so it's useful feedback.

This PR converts the per-language extractor sites individually (_make_id(str(path)) -> stem-based). The problem is that file IDs aren't only node IDs - they're also import/dependency edge targets in many resolvers (Python relative imports, the JS/TS resolver, C includes, Lua require, bash source). Converting the node format without converting every edge-target construction site in lockstep just relocates the orphan from "AST vs semantic" to "file node vs its own import edges."

Two concrete gaps in this PR:

  • _resolve_lua_import_target (extract.py:2025/2029) still returns the old-format ID, so Lua require edges would point at a phantom after the file node moved.
  • symbol_resolution.py:_file_node_id_for_path is untouched, so bash source edges orphan.

The fix I shipped (c898dc6) instead changes the single remap chokepoint in extract() - every file node ID and every edge endpoint already funnels through the #502 relative-path remap, so converting that one pass moves the node and all referencing edges together, atomically, across all languages. No per-site churn, no orphan risk.

Appreciate the effort here - the diagnosis was right, it was the surface area of the fix that was the tricky part.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants