fix: normalize file-level node IDs and inline rationale metadata#1038
fix: normalize file-level node IDs and inline rationale metadata#1038kedarvartak wants to merge 6 commits into
Conversation
… and compatibility with cached graphs
|
Good idea, but the implementation has some gaps that need fixing before merge: Missed call sites -
Skipping these means the normalization is inconsistent — some nodes get stripped suffixes, others don't, and cross-file edges that depend on ID matching will silently break. Over-broad suffix stripping - Suggested approach: instead of a global strip, normalize only at the point where a file path is converted to an ID (i.e., inside |
…xtensions and avoid ID collisions
|
@safishamsi updated as per your comment |
…/canonical-file-node-ids-1033
|
hi @safishamsi awaiting your review |
|
The ID normalization fix is correct and addresses a real bug (#1033) — but the PR has two problems: 1. The sweep is incomplete Four callsites still use
These will still produce mismatched IDs and dangling edges after the rest of the fix lands. Finish the sweep. 2. The rationale refactor is a separate breaking change — split it out Storing rationale as an inline attribute instead of a Please open a separate PR for the rationale refactor once the full schema impact is mapped. Ship the ID normalization sweep on its own — that's the high-value fix. |
… and update related extraction logic
|
@safishamsi hi, sorry for the reiterations, all callsites are now migrated to file stem. I'm currently working on the rationale refactor. pls check |
|
Hey, thanks for this. I’m a little caught up today, will look into it over
the weekend!
Thanks
…On Thu, 28 May 2026 at 18:30, Kedar Vartak ***@***.***> wrote:
*kedarvartak* left a comment (safishamsi/graphify#1038)
<#1038 (comment)>
@safishamsi <https://github.com/safishamsi> hi, sorry for the
reiterations, all callsites are now migrated to file stem. I'm currently
working on the rationale refactor. pls check
—
Reply to this email directly, view it on GitHub
<#1038?email_source=notifications&email_token=BTSTP62UOHCF3VDBO3RNNET45BZSXA5CNFSNUABFM5UWIORPF5TWS5BNNB2WEL2JONZXKZKDN5WW2ZLOOQXTINJWGY3DGMRXGU32M4TFMFZW63VHNVSW45DJN5XKKZLWMVXHJLDGN5XXIZLSL5RWY2LDNM#issuecomment-4566632757>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/BTSTP65ZEMGMLW6WSLEUOWT45BZSXAVCNFSM6AAAAACZOS3L6SVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHM2DKNRWGYZTENZVG4>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
AST file nodes were ID'd from the full relative path plus extension
(match_script_pipeline_step_py) while semantic subagents follow the
{parent_dir}_{stem} spec (script_pipeline_step), so every file split
into two disconnected ghost nodes.
Fix at the single remap chokepoint in extract(): file node IDs and all
edge endpoints already funnel through the #502 relative-path remap, so
changing that remap to emit _file_node_id (one parent dir, no extension)
converts the node and every referencing edge together - Python, TS, Lua,
C and bash import edges all stay connected. symbol_resolution pre-computes
the canonical form directly (bypassing the remap) so it is synced too.
Per-site conversion (as attempted in #1038/#1065) orphans edges because
it moves the node without the edge targets; the chokepoint approach
avoids that entirely.
Backward compat for existing graphs: graphify extract --force, as the
skill.md spec already documents.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
Thanks for this @kedarvartak - this PR did more than #1065 (the backward-compat normalizer and the bash/symbol_resolution sync were the right instincts). I ended up shipping a different fix for #1033 in On the ID fix: like #1065, this converts the per-extractor sites individually. File IDs are also import/dependency edge targets across many resolvers, so per-site conversion risks moving a node without its edges. This PR still misses On the On the rationale inlining (sub-issue #2): Genuinely appreciate the thoroughness here - closing only because the chokepoint approach is the smaller, safer surface area for the core fix. |
Description
Problem
Two related bugs caused split/orphan nodes in
graph.jsonwhen files lived under subdirectories:ID format mismatch
The AST extractor generated file-level node IDs from the full relative path including extension (
_make_id(str(path))), producing IDs likescript_pipeline_step_py.Semantic subagents follow the
skill.mdspec and generate IDs in{parent_dir}_{stem}format (without extension), e.g.script_pipeline_step.Since Step 3C deduplication is keyed on
id, the nodes never merged, causing duplicated disconnected graph nodes with separate edge sets.Rationale orphan nodes
The AST extractor emitted docstrings and
# NOTE:comments as standalonefile_type="rationale"nodes connected viarationale_foredges.These created single-edge islands similar to the semantic-side rationale issue fixed in
0.8.16.Fix
extract.pyfile_nid = _make_id(str(path))usages (27 call sites) with_make_id(_file_stem(path))._file_stem()helper to normalize IDs into{parent_dir}_{stem}format._add_rationale()to store rationale/docstring text directly on the parent node (file,class, orfunction) as a"rationale"attribute instead of creating separate rationale nodes +rationale_foredges.build.py_strip_ext_suffix()with regex coverage for known language suffixes (_py,_ts,_js, etc.).build_from_json()canonicalization to all node IDs and edge endpoints.graph.jsonfiles now normalize on load without requiring--forcere-extraction.tests/test_rationale.py"rationale"attribute instead of standalone rationale nodes andrationale_foredges.Verification
Before