Skip to content

fix: normalize file-level node IDs and inline rationale metadata#1038

Closed
kedarvartak wants to merge 6 commits into
safishamsi:v8from
kedarvartak:fix/canonical-file-node-ids-1033
Closed

fix: normalize file-level node IDs and inline rationale metadata#1038
kedarvartak wants to merge 6 commits into
safishamsi:v8from
kedarvartak:fix/canonical-file-node-ids-1033

Conversation

@kedarvartak
Copy link
Copy Markdown

Description

Problem

Two related bugs caused split/orphan nodes in graph.json when files lived under subdirectories:

  1. ID format mismatch
    The AST extractor generated file-level node IDs from the full relative path including extension (_make_id(str(path))), producing IDs like script_pipeline_step_py.

    Semantic subagents follow the skill.md spec and generate IDs in {parent_dir}_{stem} format (without extension), e.g. script_pipeline_step.

    Since Step 3C deduplication is keyed on id, the nodes never merged, causing duplicated disconnected graph nodes with separate edge sets.

  2. Rationale orphan nodes
    The AST extractor emitted docstrings and # NOTE: comments as standalone file_type="rationale" nodes connected via rationale_for edges.

    These created single-edge islands similar to the semantic-side rationale issue fixed in 0.8.16.


Fix

extract.py

  • Replace all file_nid = _make_id(str(path)) usages (27 call sites) with _make_id(_file_stem(path)).
  • Reuse the existing _file_stem() helper to normalize IDs into {parent_dir}_{stem} format.
  • Rewrite _add_rationale() to store rationale/docstring text directly on the parent node (file, class, or function) as a "rationale" attribute instead of creating separate rationale nodes + rationale_for edges.

build.py

  • Add _strip_ext_suffix() with regex coverage for known language suffixes (_py, _ts, _js, etc.).
  • Apply normalization during build_from_json() canonicalization to all node IDs and edge endpoints.
  • Existing cached graph.json files now normalize on load without requiring --force re-extraction.

tests/test_rationale.py

  • Update tests to assert against the new inline "rationale" attribute instead of standalone rationale nodes and rationale_for edges.

Verification

Before

_make_id(str(Path("script/pipeline_step.py")))
# -> "script_pipeline_step_py"

@safishamsi
Copy link
Copy Markdown
Owner

Good idea, but the implementation has some gaps that need fixing before merge:

Missed call sites - _strip_ext_suffix is not applied at 4 places in extract.py that also generate node IDs from filenames:

  • Svelte component IDs (~line 2647)
  • Astro component IDs (~line 2785)
  • Symbol resolution path (~line 4946)
  • ID remap block (~line 8538)

Skipping these means the normalization is inconsistent — some nodes get stripped suffixes, others don't, and cross-file edges that depend on ID matching will silently break.

Over-broad suffix stripping - _strip_ext_suffix as written will corrupt IDs like format_c, test_py, parse_json (variable/function names that happen to end in _<ext>). The stripping needs to be restricted to IDs that are known to come from file paths, not applied globally.

Suggested approach: instead of a global strip, normalize only at the point where a file path is converted to an ID (i.e., inside _file_stem or a new _canonical_file_id helper), and leave symbol IDs untouched. Happy to review a revised version!

@kedarvartak
Copy link
Copy Markdown
Author

@safishamsi updated as per your comment

@kedarvartak
Copy link
Copy Markdown
Author

hi @safishamsi awaiting your review

@safishamsi
Copy link
Copy Markdown
Owner

The ID normalization fix is correct and addresses a real bug (#1033) — but the PR has two problems:

1. The sweep is incomplete

Four callsites still use _make_id(str(path)) instead of _make_id(_file_stem(path)):

  • _resolve_js_import_target (line ~765 in extract.py)
  • extract_svelte dynamic-import resolution (~lines 2700, 2714)
  • extract_astro dynamic-import resolution (~line 2766)
  • _apply_symbol_resolution_facts (~line 4986)

These will still produce mismatched IDs and dangling edges after the rest of the fix lands. Finish the sweep.

2. The rationale refactor is a separate breaking change — split it out

Storing rationale as an inline attribute instead of a file_type="rationale" node changes the schema in a way that breaks several downstream consumers that are NOT updated in this PR: VALID_FILE_TYPES in validate.py, _FILE_TYPE_SYNONYMS in build.py, file_type == "rationale" filters in extract.py and report.py. The LLM semantic backends also still emit rationale nodes — so you'd end up with inconsistent behaviour (AST = attribute, LLM = node).

Please open a separate PR for the rationale refactor once the full schema impact is mapped. Ship the ID normalization sweep on its own — that's the high-value fix.

@kedarvartak
Copy link
Copy Markdown
Author

@safishamsi hi, sorry for the reiterations, all callsites are now migrated to file stem. I'm currently working on the rationale refactor. pls check

@safishamsi
Copy link
Copy Markdown
Owner

safishamsi commented May 28, 2026 via email

safishamsi added a commit that referenced this pull request May 31, 2026
AST file nodes were ID'd from the full relative path plus extension
(match_script_pipeline_step_py) while semantic subagents follow the
{parent_dir}_{stem} spec (script_pipeline_step), so every file split
into two disconnected ghost nodes.

Fix at the single remap chokepoint in extract(): file node IDs and all
edge endpoints already funnel through the #502 relative-path remap, so
changing that remap to emit _file_node_id (one parent dir, no extension)
converts the node and every referencing edge together - Python, TS, Lua,
C and bash import edges all stay connected. symbol_resolution pre-computes
the canonical form directly (bypassing the remap) so it is synced too.

Per-site conversion (as attempted in #1038/#1065) orphans edges because
it moves the node without the edge targets; the chokepoint approach
avoids that entirely.

Backward compat for existing graphs: graphify extract --force, as the
skill.md spec already documents.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@safishamsi
Copy link
Copy Markdown
Owner

Thanks for this @kedarvartak - this PR did more than #1065 (the backward-compat normalizer and the bash/symbol_resolution sync were the right instincts). I ended up shipping a different fix for #1033 in c898dc6 and am closing this one; explaining the reasoning since the effort was real.

On the ID fix: like #1065, this converts the per-extractor sites individually. File IDs are also import/dependency edge targets across many resolvers, so per-site conversion risks moving a node without its edges. This PR still misses _resolve_lua_import_target (extract.py:2025/2029), so Lua require edges would orphan. The approach I took changes the single remap chokepoint in extract() instead - every file node and edge endpoint already funnels through the #502 relative-path remap, so one change converts node + all edges atomically across every language (verified Python/TS/Lua/C/bash all stay connected).

On the build.py extension-suffix normalizer: I left this out deliberately. Stripping _py/_ts/_js suffixes from existing graphs is fragile - it can clip legitimate symbol IDs (format_c, test_py, a function literally named parse_js). Your _looks_like_file_node guard mitigated it, but the project's documented migration path is already graphify extract --force, which is unambiguous. I'd rather not carry a heuristic that can mis-fire on symbol IDs.

On the rationale inlining (sub-issue #2): file_type == "rationale" nodes with rationale_for edges are intentional and have dedicated tests; the cross-file resolver and god-node ranking explicitly special-case them. Changing them to attributes is a separate design call, not part of the ID-mismatch bug. Worth its own issue if there's a case for it.

Genuinely appreciate the thoroughness here - closing only because the chokepoint approach is the smaller, safer surface area for the core fix.

@safishamsi safishamsi closed this May 31, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants