Skip to content

feat: extract markdown links as first-class edges (fixes #951)#1066

Open
adityachaudhary99 wants to merge 6 commits into
safishamsi:v8from
adityachaudhary99:feat/markdown-link-edges
Open

feat: extract markdown links as first-class edges (fixes #951)#1066
adityachaudhary99 wants to merge 6 commits into
safishamsi:v8from
adityachaudhary99:feat/markdown-link-edges

Conversation

@adityachaudhary99
Copy link
Copy Markdown
Contributor

Problem

Markdown documents often contain rich cross-references through links and [[wikilinks]], but graphify ignored these as first-class graph relationships.

What changed

  • _resolve_markdown_link() — resolves text targets, handles anchors, extensionless refs, external URL filtering, title attribute stripping
  • _resolve_markdown_wikilink() — resolves [[Page Name]] to existing .md files
  • extract_markdown() — emits links_to edges with EXTRACTED confidence and link text as edge context
  • Edge cases handled: ignores image links, excludes links inside fenced code blocks, strips [[page|display text]] pipe syntax
  • links_to added to SEMANTIC_RELATIONS
  • Markdown file-level IDs now use _file_node_id
  • 8 new tests covering normal links, wikilinks, pipe syntax, images, code blocks, external URLs, title attributes

Depends on

- Add _file_node_id(path) helper that returns _make_id(_file_stem(path))
- Use _file_node_id for all file-level node IDs instead of _make_id(str(path))
- Update all import resolution targets to reference _file_node_id format
- Update extract() legacy remap to handle both old formats
- Update tests to use _file_node_id

This ensures AST and semantic subagent nodes for the same file use
identical node IDs (parent_dir_stem), fixing the split-node bug where
one physical file appeared as two disconnected nodes (safishamsi#1033).
- ensure_named_node now always uses stem-qualified IDs
- Same fix for superclass/inheritance resolution in walk()
- Same fix for C#, Swift, C++, Java base type fallbacks
- Removes bare-name fallback that caused cross-file collisions

Previously, _make_id(name) (bare, no stem) was used as fallback when
_make_id(stem, name) was not in the per-file seen_ids set, causing
identically-named entities in different files to produce colliding IDs.
This caused the second entity's node to overwrite the first in the
NetworkX graph, losing one entity entirely (safishamsi#952).
- Add _resolve_markdown_link() for [text](path) resolution
- Add _resolve_markdown_wikilink() for [[page-name]] resolution
- Extract links_to edges in extract_markdown() for all resolvable links
- Skips external URLs, anchors, and unresolvable paths
- Edge context stores the link text/name as metadata

This adds a deterministic pre-pass that captures human-authored
inter-document links as high-confidence edges, dramatically reducing
isolated nodes in documentation-heavy corpora (safishamsi#951).
@adityachaudhary99 adityachaudhary99 force-pushed the feat/markdown-link-edges branch from f06e859 to 3a2f0c5 Compare May 28, 2026 17:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant