Skip to content

Add HTML parser: extract inline <script> functions + id-anchored DocSection nodes #521

@dannymn

Description

@dannymn

Motivation

Hand-authored single-page HTML documents — design briefs, HLDs, READMEs rendered to HTML, internal architecture pages — frequently mix substantial prose content with a small inline <script> block driving navigation/search/scroll-tracking, plus a <style> block for layout.

Today CRG sees these as opaque files (a single File node, nothing else). The inline JS functions are invisible to semantic_search, and there's no way to ask "where is section §17 in this 6 KLOC doc" without falling back to grep.

This proposal adds first-class HTML support that's symmetric to the existing Vue/Svelte SFC handling.

Proposal

Two extraction concerns, both already proven by _parse_vue / _parse_svelte:

1. Inline <script> blocks → delegate to JS parser

Identical pattern to _parse_vue: walk the HTML tree, find script_element nodes, parse their raw_text with the JavaScript grammar, propagate the resulting Function / Class / CALLS nodes/edges with line_start / line_end offset to position within the .html file.

2. Id-anchored elements → new DocSection node kind

For each <div id="…"> / <section id="…"> / <article id="…">, emit a DocSection node with the element's line range and html_id in extra. This is the navigation primitive a sidebar typically links to — making it searchable lets queries like "where does the cover section start" or "which page has the search-tabs anchor" resolve in one semantic_search call.

Deliberately out of scope

  • <style> block extraction. No CSS grammar plumbed in CRG yet; a line-range-only stub would have negligible value. Easy follow-up if/when a CSS extractor lands.
  • Heading-anchor sections (<h2 id="…">). Common in GitHub-rendered READMEs but pollutes the index when nested inside <div id="…"> doc sections. Could be added with an opt-in if there's demand.

Reference implementation

Branch: feat/html-parser-support on my fork.

3-file diff, +298 LOC:

  • code_review_graph/parser.py: 2 lines added to EXTENSION_TO_LANGUAGE, 3-line dispatch in parse_bytes, ~140-line _parse_html method modeled on _parse_vue.
  • tests/test_parser.py: 7 new tests in the existing TestCodeParser class.
  • tests/fixtures/sample_html.html: one fixture exercising every code path.

Tests (all pass; full test_parser.py 110/110):

  • test_detect_language_html (.html and .htm)
  • test_parse_html_file — File node + Function + Class extracted from <script>
  • test_parse_html_doc_sections#cover, #s1, #s2, #appendix emitted
  • test_parse_html_line_numbers_offset — script line numbers reflect .html position
  • test_parse_html_nodes_have_html_language — every node tagged language="html"
  • test_parse_html_no_script — DocSection emitted even without any script
  • test_parse_html_only_id_anchored_sections — plain <div> without id excluded

Field-validated locally: indexed a 6 KLOC HTML (an HLD) and a 3 KLOC reference HTML (a design brief), found 60+ JS functions with accurate line ranges, 39 DocSections, all surfaced by semantic_search queries.

Questions for the maintainer before opening a PR

  1. Node-kind naming. DocSection is intentionally generic — could later cover Markdown headings or other doc formats. Would you prefer HTMLSection to keep it format-scoped? Or skip the new kind entirely and reuse Class / File?
  2. Tag whitelist for sections. Currently div / section / article with id. Worth widening to any element with id? Or narrowing to just section / article?
  3. <style> handling. Drop entirely (current proposal), emit a line-range-only Stylesheet stub, or wait until a CSS extractor lands?
  4. Heading-anchor sections (<h2 id="x">). Opt-in flag, always-on, or out of scope?

Happy to adjust the patch to match your preferences before submitting a PR. Local CRG using this patch has been running for ~1 day with zero regressions across 869 indexed files.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions