fix: normalize file-level node IDs and inline rationale metadata by kedarvartak · Pull Request #1038 · safishamsi/graphify

kedarvartak · 2026-05-27T04:15:00Z

Description

Problem

Two related bugs caused split/orphan nodes in graph.json when files lived under subdirectories:

ID format mismatch
The AST extractor generated file-level node IDs from the full relative path including extension (_make_id(str(path))), producing IDs like script_pipeline_step_py.

Semantic subagents follow the skill.md spec and generate IDs in {parent_dir}_{stem} format (without extension), e.g. script_pipeline_step.

Since Step 3C deduplication is keyed on id, the nodes never merged, causing duplicated disconnected graph nodes with separate edge sets.
Rationale orphan nodes
The AST extractor emitted docstrings and # NOTE: comments as standalone file_type="rationale" nodes connected via rationale_for edges.

These created single-edge islands similar to the semantic-side rationale issue fixed in 0.8.16.

Fix

`extract.py`

Replace all file_nid = _make_id(str(path)) usages (27 call sites) with _make_id(_file_stem(path)).
Reuse the existing _file_stem() helper to normalize IDs into {parent_dir}_{stem} format.
Rewrite _add_rationale() to store rationale/docstring text directly on the parent node (file, class, or function) as a "rationale" attribute instead of creating separate rationale nodes + rationale_for edges.

`build.py`

Add _strip_ext_suffix() with regex coverage for known language suffixes (_py, _ts, _js, etc.).
Apply normalization during build_from_json() canonicalization to all node IDs and edge endpoints.
Existing cached graph.json files now normalize on load without requiring --force re-extraction.

`tests/test_rationale.py`

Update tests to assert against the new inline "rationale" attribute instead of standalone rationale nodes and rationale_for edges.

Verification

Before

_make_id(str(Path("script/pipeline_step.py")))
# -> "script_pipeline_step_py"

… and compatibility with cached graphs

safishamsi · 2026-05-27T10:17:17Z

Good idea, but the implementation has some gaps that need fixing before merge:

Missed call sites - _strip_ext_suffix is not applied at 4 places in extract.py that also generate node IDs from filenames:

Svelte component IDs (~line 2647)
Astro component IDs (~line 2785)
Symbol resolution path (~line 4946)
ID remap block (~line 8538)

Skipping these means the normalization is inconsistent — some nodes get stripped suffixes, others don't, and cross-file edges that depend on ID matching will silently break.

Over-broad suffix stripping - _strip_ext_suffix as written will corrupt IDs like format_c, test_py, parse_json (variable/function names that happen to end in _<ext>). The stripping needs to be restricted to IDs that are known to come from file paths, not applied globally.

Suggested approach: instead of a global strip, normalize only at the point where a file path is converted to an ID (i.e., inside _file_stem or a new _canonical_file_id helper), and leave symbol IDs untouched. Happy to review a revised version!

…xtensions and avoid ID collisions

kedarvartak · 2026-05-27T10:51:28Z

@safishamsi updated as per your comment

…/canonical-file-node-ids-1033

kedarvartak · 2026-05-28T04:10:56Z

hi @safishamsi awaiting your review

safishamsi · 2026-05-28T13:46:50Z

The ID normalization fix is correct and addresses a real bug (#1033) — but the PR has two problems:

1. The sweep is incomplete

Four callsites still use _make_id(str(path)) instead of _make_id(_file_stem(path)):

_resolve_js_import_target (line ~765 in extract.py)
extract_svelte dynamic-import resolution (~lines 2700, 2714)
extract_astro dynamic-import resolution (~line 2766)
_apply_symbol_resolution_facts (~line 4986)

These will still produce mismatched IDs and dangling edges after the rest of the fix lands. Finish the sweep.

2. The rationale refactor is a separate breaking change — split it out

Storing rationale as an inline attribute instead of a file_type="rationale" node changes the schema in a way that breaks several downstream consumers that are NOT updated in this PR: VALID_FILE_TYPES in validate.py, _FILE_TYPE_SYNONYMS in build.py, file_type == "rationale" filters in extract.py and report.py. The LLM semantic backends also still emit rationale nodes — so you'd end up with inconsistent behaviour (AST = attribute, LLM = node).

Please open a separate PR for the rationale refactor once the full schema impact is mapped. Ship the ID normalization sweep on its own — that's the high-value fix.

… and update related extraction logic

kedarvartak · 2026-05-28T17:29:56Z

@safishamsi hi, sorry for the reiterations, all callsites are now migrated to file stem. I'm currently working on the rationale refactor. pls check

safishamsi · 2026-05-28T17:37:45Z

Hey, thanks for this. I’m a little caught up today, will look into it over the weekend! Thanks

…

On Thu, 28 May 2026 at 18:30, Kedar Vartak ***@***.***> wrote: *kedarvartak* left a comment (safishamsi/graphify#1038) <#1038 (comment)> @safishamsi <https://github.com/safishamsi> hi, sorry for the reiterations, all callsites are now migrated to file stem. I'm currently working on the rationale refactor. pls check — Reply to this email directly, view it on GitHub <#1038?email_source=notifications&email_token=BTSTP62UOHCF3VDBO3RNNET45BZSXA5CNFSNUABFM5UWIORPF5TWS5BNNB2WEL2JONZXKZKDN5WW2ZLOOQXTINJWGY3DGMRXGU32M4TFMFZW63VHNVSW45DJN5XKKZLWMVXHJLDGN5XXIZLSL5RWY2LDNM#issuecomment-4566632757>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/BTSTP65ZEMGMLW6WSLEUOWT45BZSXAVCNFSM6AAAAACZOS3L6SVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHM2DKNRWGYZTENZVG4> . You are receiving this because you were mentioned.Message ID: ***@***.***>

AST file nodes were ID'd from the full relative path plus extension (match_script_pipeline_step_py) while semantic subagents follow the {parent_dir}_{stem} spec (script_pipeline_step), so every file split into two disconnected ghost nodes. Fix at the single remap chokepoint in extract(): file node IDs and all edge endpoints already funnel through the #502 relative-path remap, so changing that remap to emit _file_node_id (one parent dir, no extension) converts the node and every referencing edge together - Python, TS, Lua, C and bash import edges all stay connected. symbol_resolution pre-computes the canonical form directly (bypassing the remap) so it is synced too. Per-site conversion (as attempted in #1038/#1065) orphans edges because it moves the node without the edge targets; the chokepoint approach avoids that entirely. Backward compat for existing graphs: graphify extract --force, as the skill.md spec already documents. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

safishamsi · 2026-05-31T11:24:53Z

Thanks for this @kedarvartak - this PR did more than #1065 (the backward-compat normalizer and the bash/symbol_resolution sync were the right instincts). I ended up shipping a different fix for #1033 in c898dc6 and am closing this one; explaining the reasoning since the effort was real.

On the ID fix: like #1065, this converts the per-extractor sites individually. File IDs are also import/dependency edge targets across many resolvers, so per-site conversion risks moving a node without its edges. This PR still misses _resolve_lua_import_target (extract.py:2025/2029), so Lua require edges would orphan. The approach I took changes the single remap chokepoint in extract() instead - every file node and edge endpoint already funnels through the #502 relative-path remap, so one change converts node + all edges atomically across every language (verified Python/TS/Lua/C/bash all stay connected).

On the build.py extension-suffix normalizer: I left this out deliberately. Stripping _py/_ts/_js suffixes from existing graphs is fragile - it can clip legitimate symbol IDs (format_c, test_py, a function literally named parse_js). Your _looks_like_file_node guard mitigated it, but the project's documented migration path is already graphify extract --force, which is unambiguous. I'd rather not carry a heuristic that can mis-fire on symbol IDs.

On the rationale inlining (sub-issue #2): file_type == "rationale" nodes with rationale_for edges are intentional and have dedicated tests; the cross-file resolver and god-node ranking explicitly special-case them. Changing them to attributes is a separate design call, not part of the ID-mismatch bug. Worth its own issue if there's a case for it.

Genuinely appreciate the thoroughness here - closing only because the chokepoint approach is the smaller, safer surface area for the core fix.

fix: update file node ID generation to use _file_stem for consistency…

639bb64

… and compatibility with cached graphs

kedarvartak mentioned this pull request May 27, 2026

AST and semantic extractors generate mismatched node IDs for the same file, splitting it into two disconnected nodes #1033

Closed

fix: normalize file node IDs by using _file_stem to handle compound e…

34f97da

…xtensions and avoid ID collisions

Merge branch 'v8' of https://github.com/kedarvartak/graphify into fix…

427e11f

…/canonical-file-node-ids-1033

feat: implement _file_stem function for consistent file ID generation…

e3cc683

… and update related extraction logic

kedarvartak added 2 commits May 28, 2026 23:10

fix: limit legacy extension stripping to file nodes

7d85b09

test: keep rationale schema unchanged in id fix

4b621e8

safishamsi closed this May 31, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: normalize file-level node IDs and inline rationale metadata#1038

fix: normalize file-level node IDs and inline rationale metadata#1038
kedarvartak wants to merge 6 commits into
safishamsi:v8from
kedarvartak:fix/canonical-file-node-ids-1033

kedarvartak commented May 27, 2026

Uh oh!

safishamsi commented May 27, 2026

Uh oh!

kedarvartak commented May 27, 2026

Uh oh!

kedarvartak commented May 28, 2026

Uh oh!

safishamsi commented May 28, 2026

Uh oh!

kedarvartak commented May 28, 2026

Uh oh!

safishamsi commented May 28, 2026 via email

Uh oh!

safishamsi commented May 31, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

kedarvartak commented May 27, 2026

Description

Problem

Fix

extract.py

build.py

tests/test_rationale.py

Verification

Before

Uh oh!

safishamsi commented May 27, 2026

Uh oh!

kedarvartak commented May 27, 2026

Uh oh!

kedarvartak commented May 28, 2026

Uh oh!

safishamsi commented May 28, 2026

Uh oh!

kedarvartak commented May 28, 2026

Uh oh!

safishamsi commented May 28, 2026 via email

Uh oh!

safishamsi commented May 31, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

`extract.py`

`build.py`

`tests/test_rationale.py`