fix: normalize file-level node IDs to stem-based format (fixes #1033) by adityachaudhary99 · Pull Request #1065 · safishamsi/graphify

adityachaudhary99 · 2026-05-28T16:48:08Z

Problem

AST extractor file-level node IDs used make_id(str(path)) (e.g., script_pipeline_step_py with extension). The semantic subagent uses {parent_dir}{filename_without_ext} (e.g., script_pipeline_step). This mismatch caused the same file to appear as two unconnected nodes in the knowledge graph.

Fix

Added _file_node_id(path) helper -> _make_id(_file_stem(path)) matching subagent format
Replaced all ~52 _make_id(str(path)) file-level ID calls across extractors, JS resolution, import targets, and extract() remap
Changed extract() legacy ID remap fallback from rel = path to continue — files outside root are silently skipped instead of using stale absolute paths
Added missing imports for extract_svelte and _resolve_js_module_path in test file

Scope

Affected extractors: JS/TS, Python, Go, Rust, Zig, Fortran, Julia, Pascal, Bash, Astro, Svelte, Objective-C, SQL, C/C++/C#, Java, Kotlin, Scala, PHP, Groovy, Markdown, plus import resolution and Pascal unit maps.

- Add _file_node_id(path) helper that returns _make_id(_file_stem(path)) - Use _file_node_id for all file-level node IDs instead of _make_id(str(path)) - Update all import resolution targets to reference _file_node_id format - Update extract() legacy remap to handle both old formats - Update tests to use _file_node_id This ensures AST and semantic subagent nodes for the same file use identical node IDs (parent_dir_stem), fixing the split-node bug where one physical file appeared as two disconnected nodes (safishamsi#1033).

- ensure_named_node now always uses stem-qualified IDs - Same fix for superclass/inheritance resolution in walk() - Same fix for C#, Swift, C++, Java base type fallbacks - Removes bare-name fallback that caused cross-file collisions Previously, _make_id(name) (bare, no stem) was used as fallback when _make_id(stem, name) was not in the per-file seen_ids set, causing identically-named entities in different files to produce colliding IDs. This caused the second entity's node to overwrite the first in the NetworkX graph, losing one entity entirely (safishamsi#952).

AST file nodes were ID'd from the full relative path plus extension (match_script_pipeline_step_py) while semantic subagents follow the {parent_dir}_{stem} spec (script_pipeline_step), so every file split into two disconnected ghost nodes. Fix at the single remap chokepoint in extract(): file node IDs and all edge endpoints already funnel through the #502 relative-path remap, so changing that remap to emit _file_node_id (one parent dir, no extension) converts the node and every referencing edge together - Python, TS, Lua, C and bash import edges all stay connected. symbol_resolution pre-computes the canonical form directly (bypassing the remap) so it is synced too. Per-site conversion (as attempted in #1038/#1065) orphans edges because it moves the node without the edge targets; the chokepoint approach avoids that entirely. Backward compat for existing graphs: graphify extract --force, as the skill.md spec already documents. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

safishamsi · 2026-05-31T11:24:38Z

Thanks for tackling this @adityachaudhary99. I went with a different approach for #1033 and am going to close this one - want to explain why so it's useful feedback.

This PR converts the per-language extractor sites individually (_make_id(str(path)) -> stem-based). The problem is that file IDs aren't only node IDs - they're also import/dependency edge targets in many resolvers (Python relative imports, the JS/TS resolver, C includes, Lua require, bash source). Converting the node format without converting every edge-target construction site in lockstep just relocates the orphan from "AST vs semantic" to "file node vs its own import edges."

Two concrete gaps in this PR:

_resolve_lua_import_target (extract.py:2025/2029) still returns the old-format ID, so Lua require edges would point at a phantom after the file node moved.
symbol_resolution.py:_file_node_id_for_path is untouched, so bash source edges orphan.

The fix I shipped (c898dc6) instead changes the single remap chokepoint in extract() - every file node ID and every edge endpoint already funnels through the #502 relative-path remap, so converting that one pass moves the node and all referencing edges together, atomically, across all languages. No per-site churn, no orphan risk.

Appreciate the effort here - the diagnosis was right, it was the surface area of the fix that was the tricky part.

adityachaudhary99 added 4 commits May 28, 2026 22:47

chore: remove local-only artifacts from branch

1df5e97

fix: guard extract remap against paths outside root

b471a6e

adityachaudhary99 force-pushed the fix/node-id-mismatch branch from 7b75c77 to b471a6e Compare May 28, 2026 17:23

safishamsi closed this May 31, 2026

safishamsi mentioned this pull request May 31, 2026

fix: normalize file-level node IDs and inline rationale metadata #1038

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: normalize file-level node IDs to stem-based format (fixes #1033)#1065

fix: normalize file-level node IDs to stem-based format (fixes #1033)#1065
adityachaudhary99 wants to merge 4 commits into
safishamsi:v8from
adityachaudhary99:fix/node-id-mismatch

adityachaudhary99 commented May 28, 2026

Uh oh!

safishamsi commented May 31, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

adityachaudhary99 commented May 28, 2026

Problem

Fix

Scope

Uh oh!

safishamsi commented May 31, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants