A clean, simple, lightweight HTML parser.
- follows the spec closely
- parse elements, fragments and complete html documents
- transform the parse tree and use
isCommentNode,isTextNodeisElementNodeor the genericisMNodeguards to branch on the different cases - serialize nodes and fragments back to strings and optionally remove comments
- the parser supports HTML end tag omissions
Depending on your package manager:
deno add jsr:@fcrozatier/htmlcrunch
pnpm i jsr:@fcrozatier/htmlcrunch
npx jsr add @fcrozatier/htmlcrunch
yarn add jsr:@fcrozatier/htmlcrunch
bunx jsr add @fcrozatier/htmlcrunchimport { fragments, serializeFragments } from "@fcrozatier/htmlcrunch";
import { assertEquals } from "@std/assert";
// A string of html or an html file
const content = `<div>html string...</div>`;
// Parse it with the `element`, `fragments` or `html` parsers
const parsed = fragments.parseOrThrow(content);
// Walk the parse tree, analyse and modify it ...
// Serialize the result with `serializeNode` or `serializeFragments`
const serialized = serializeFragments(parsed);
assertEquals(content, serialized);HtmlCrunch implements the following parts of the HTML spec:
| spec | status |
|---|---|
| Structure | |
| - document structure | ✅ |
| - modern doctype | ✅ |
| Elements | |
| - self-closing void elements | ✅ |
| - raw text elements | ✅ |
| - foreign elements (MathML & SVG namespaces) | ✅ |
| - normal elements | ✅ |
| Attributes | |
| - Empty attribute syntax | ✅ |
| - Unquoted attribute value syntax | ✅ |
| - Single-quoted attribute value syntax | ✅ |
| - Double-quoted attribute value syntax | ✅ |
| Optional tags | |
| - end tag omission | ✅ |
| - start tag omission | 🚫 (not planned) |
| content model validation and restriction | |
| text | ✅ |
| CDATA sections | ✅ |
| comments | ✅ |
In HTML, the end tags of <li>, <dt>, <dd>, <p> and <option> elements,
as well as the end tags of <table> children elements
can be omitted for a
lighter authoring experience
import { element, serializeNode } from "@fcrozatier/htmlcrunch";
// Omit `<li>` end tags
element.parseOrThrow(
`<ul>
<li>Apples
<li>Bananas
</ul>`,
);
// Omit `<dt>` and `<dd>` end tags
element.parseOrThrow(
`<dl>
<dt>Coffee
<dd>Black hot drink
<dt>Milk
<dd>White cold drink
</dl>`,
);
// Omit `<p>` end tags
element.parseOrThrow(
`<body>
<p>This is the first paragraph.
<p>This is the second paragraph, and it ends when the next div begins.
<div>A block element</div>
</body>`,
);
// Omit `<option>` end tags
element.parseOrThrow(
`<select>
<option value="1">One
<option value="2">Two
<option value="3">Three
</select>`,
);
// Omit end tags inside a `<table>`
const table = element.parseOrThrow(
`<table>
<caption>37547 TEE Electric Powered Rail Car Train Functions (Abbreviated)
<colgroup><col><col><col>
<thead>
<tr> <th>Function <th>Control Unit <th>Central Station
<tbody>
<tr> <td>Headlights <td>✔ <td>✔
<tr> <td>Interior Lights <td>✔ <td>✔
<tr> <td>Electric locomotive operating sounds <td>✔ <td>✔
<tr> <td>Engineer's cab lighting <td> <td>✔
<tr> <td>Station Announcements - Swiss <td> <td>✔
</table>`,
);The interactive documentation is available on JSR.
The elements, fragments, html and shadowRoot parsers are Monarch Parsers and can thus be composed and extended with other Monarch parsers.
Their main methods are
parse and
parseOrThrow.
See Monarch documentation for the other
available methods.
