Motivation
Hand-authored single-page HTML documents — design briefs, HLDs, READMEs rendered to HTML, internal architecture pages — frequently mix substantial prose content with a small inline <script> block driving navigation/search/scroll-tracking, plus a <style> block for layout.
Today CRG sees these as opaque files (a single File node, nothing else). The inline JS functions are invisible to semantic_search, and there's no way to ask "where is section §17 in this 6 KLOC doc" without falling back to grep.
This proposal adds first-class HTML support that's symmetric to the existing Vue/Svelte SFC handling.
Proposal
Two extraction concerns, both already proven by _parse_vue / _parse_svelte:
1. Inline <script> blocks → delegate to JS parser
Identical pattern to _parse_vue: walk the HTML tree, find script_element nodes, parse their raw_text with the JavaScript grammar, propagate the resulting Function / Class / CALLS nodes/edges with line_start / line_end offset to position within the .html file.
2. Id-anchored elements → new DocSection node kind
For each <div id="…"> / <section id="…"> / <article id="…">, emit a DocSection node with the element's line range and html_id in extra. This is the navigation primitive a sidebar typically links to — making it searchable lets queries like "where does the cover section start" or "which page has the search-tabs anchor" resolve in one semantic_search call.
Deliberately out of scope
<style> block extraction. No CSS grammar plumbed in CRG yet; a line-range-only stub would have negligible value. Easy follow-up if/when a CSS extractor lands.
- Heading-anchor sections (
<h2 id="…">). Common in GitHub-rendered READMEs but pollutes the index when nested inside <div id="…"> doc sections. Could be added with an opt-in if there's demand.
Reference implementation
Branch: feat/html-parser-support on my fork.
3-file diff, +298 LOC:
code_review_graph/parser.py: 2 lines added to EXTENSION_TO_LANGUAGE, 3-line dispatch in parse_bytes, ~140-line _parse_html method modeled on _parse_vue.
tests/test_parser.py: 7 new tests in the existing TestCodeParser class.
tests/fixtures/sample_html.html: one fixture exercising every code path.
Tests (all pass; full test_parser.py 110/110):
test_detect_language_html (.html and .htm)
test_parse_html_file — File node + Function + Class extracted from <script>
test_parse_html_doc_sections — #cover, #s1, #s2, #appendix emitted
test_parse_html_line_numbers_offset — script line numbers reflect .html position
test_parse_html_nodes_have_html_language — every node tagged language="html"
test_parse_html_no_script — DocSection emitted even without any script
test_parse_html_only_id_anchored_sections — plain <div> without id excluded
Field-validated locally: indexed a 6 KLOC HTML (an HLD) and a 3 KLOC reference HTML (a design brief), found 60+ JS functions with accurate line ranges, 39 DocSections, all surfaced by semantic_search queries.
Questions for the maintainer before opening a PR
- Node-kind naming.
DocSection is intentionally generic — could later cover Markdown headings or other doc formats. Would you prefer HTMLSection to keep it format-scoped? Or skip the new kind entirely and reuse Class / File?
- Tag whitelist for sections. Currently
div / section / article with id. Worth widening to any element with id? Or narrowing to just section / article?
<style> handling. Drop entirely (current proposal), emit a line-range-only Stylesheet stub, or wait until a CSS extractor lands?
- Heading-anchor sections (
<h2 id="x">). Opt-in flag, always-on, or out of scope?
Happy to adjust the patch to match your preferences before submitting a PR. Local CRG using this patch has been running for ~1 day with zero regressions across 869 indexed files.
Motivation
Hand-authored single-page HTML documents — design briefs, HLDs, READMEs rendered to HTML, internal architecture pages — frequently mix substantial prose content with a small inline
<script>block driving navigation/search/scroll-tracking, plus a<style>block for layout.Today CRG sees these as opaque files (a single
Filenode, nothing else). The inline JS functions are invisible tosemantic_search, and there's no way to ask "where is section §17 in this 6 KLOC doc" without falling back to grep.This proposal adds first-class HTML support that's symmetric to the existing Vue/Svelte SFC handling.
Proposal
Two extraction concerns, both already proven by
_parse_vue/_parse_svelte:1. Inline
<script>blocks → delegate to JS parserIdentical pattern to
_parse_vue: walk the HTML tree, findscript_elementnodes, parse theirraw_textwith the JavaScript grammar, propagate the resultingFunction/Class/CALLSnodes/edges withline_start/line_endoffset to position within the .html file.2. Id-anchored elements → new
DocSectionnode kindFor each
<div id="…">/<section id="…">/<article id="…">, emit aDocSectionnode with the element's line range andhtml_idinextra. This is the navigation primitive a sidebar typically links to — making it searchable lets queries like "where does the cover section start" or "which page has the search-tabs anchor" resolve in onesemantic_searchcall.Deliberately out of scope
<style>block extraction. No CSS grammar plumbed in CRG yet; a line-range-only stub would have negligible value. Easy follow-up if/when a CSS extractor lands.<h2 id="…">). Common in GitHub-rendered READMEs but pollutes the index when nested inside<div id="…">doc sections. Could be added with an opt-in if there's demand.Reference implementation
Branch:
feat/html-parser-supporton my fork.3-file diff, +298 LOC:
code_review_graph/parser.py: 2 lines added toEXTENSION_TO_LANGUAGE, 3-line dispatch inparse_bytes, ~140-line_parse_htmlmethod modeled on_parse_vue.tests/test_parser.py: 7 new tests in the existingTestCodeParserclass.tests/fixtures/sample_html.html: one fixture exercising every code path.Tests (all pass; full
test_parser.py110/110):test_detect_language_html(.html and .htm)test_parse_html_file— File node + Function + Class extracted from<script>test_parse_html_doc_sections—#cover,#s1,#s2,#appendixemittedtest_parse_html_line_numbers_offset— script line numbers reflect .html positiontest_parse_html_nodes_have_html_language— every node taggedlanguage="html"test_parse_html_no_script— DocSection emitted even without any scripttest_parse_html_only_id_anchored_sections— plain<div>without id excludedField-validated locally: indexed a 6 KLOC HTML (an HLD) and a 3 KLOC reference HTML (a design brief), found 60+ JS functions with accurate line ranges, 39 DocSections, all surfaced by
semantic_searchqueries.Questions for the maintainer before opening a PR
DocSectionis intentionally generic — could later cover Markdown headings or other doc formats. Would you preferHTMLSectionto keep it format-scoped? Or skip the new kind entirely and reuseClass/File?div/section/articlewithid. Worth widening to any element withid? Or narrowing to justsection/article?<style>handling. Drop entirely (current proposal), emit a line-range-onlyStylesheetstub, or wait until a CSS extractor lands?<h2 id="x">). Opt-in flag, always-on, or out of scope?Happy to adjust the patch to match your preferences before submitting a PR. Local CRG using this patch has been running for ~1 day with zero regressions across 869 indexed files.