Skip to content

fcrozatier/htmlcrunch

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

20 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

HTMLCrunch

A clean, simple, lightweight HTML parser.

Features

Getting Started

Depending on your package manager:

deno add jsr:@fcrozatier/htmlcrunch
pnpm i jsr:@fcrozatier/htmlcrunch
npx jsr add @fcrozatier/htmlcrunch
yarn add jsr:@fcrozatier/htmlcrunch
bunx jsr add @fcrozatier/htmlcrunch

Simple Example

import { fragments, serializeFragments } from "@fcrozatier/htmlcrunch";
import { assertEquals } from "@std/assert";

// A string of html or an html file
const content = `<div>html string...</div>`;

// Parse it with the `element`, `fragments` or `html` parsers
const parsed = fragments.parseOrThrow(content);

// Walk the parse tree, analyse and modify it ...

// Serialize the result with `serializeNode` or `serializeFragments`
const serialized = serializeFragments(parsed);

assertEquals(content, serialized);

Spec

HtmlCrunch implements the following parts of the HTML spec:

spec status
Structure
- document structure
- modern doctype
Elements
- self-closing void elements
- raw text elements
- foreign elements (MathML & SVG namespaces)
- normal elements
Attributes
- Empty attribute syntax
- Unquoted attribute value syntax
- Single-quoted attribute value syntax
- Double-quoted attribute value syntax
Optional tags
- end tag omission
- start tag omission 🚫 (not planned)
content model validation and restriction ⚠️ (not supported)
text
CDATA sections
comments

End Tag Omission

In HTML, the end tags of <li>, <dt>, <dd>, <p> and <option> elements, as well as the end tags of <table> children elements can be omitted for a lighter authoring experience

import { element, serializeNode } from "@fcrozatier/htmlcrunch";

// Omit `<li>` end tags
element.parseOrThrow(
  `<ul>
    <li>Apples
    <li>Bananas
  </ul>`,
);

// Omit `<dt>` and `<dd>` end tags
element.parseOrThrow(
  `<dl>
    <dt>Coffee
    <dd>Black hot drink
    <dt>Milk
    <dd>White cold drink
  </dl>`,
);

// Omit `<p>` end tags
element.parseOrThrow(
  `<body>
    <p>This is the first paragraph.
    <p>This is the second paragraph, and it ends when the next div begins.
    <div>A block element</div>
  </body>`,
);

// Omit `<option>` end tags
element.parseOrThrow(
  `<select>
    <option value="1">One
    <option value="2">Two
    <option value="3">Three
  </select>`,
);

// Omit end tags inside a `<table>`
const table = element.parseOrThrow(
  `<table>
  <caption>37547 TEE Electric Powered Rail Car Train Functions (Abbreviated)
  <colgroup><col><col><col>
  <thead>
   <tr> <th>Function                              <th>Control Unit     <th>Central Station
  <tbody>
   <tr> <td>Headlights                            <td>✔                <td>✔
   <tr> <td>Interior Lights                       <td>✔                <td>✔
   <tr> <td>Electric locomotive operating sounds  <td>✔                <td>✔
   <tr> <td>Engineer's cab lighting               <td>                 <td>✔
   <tr> <td>Station Announcements - Swiss         <td>                 <td>✔
  </table>`,
);

API

The interactive documentation is available on JSR.

The elements, fragments, html and shadowRoot parsers are Monarch Parsers and can thus be composed and extended with other Monarch parsers.

Their main methods are parse and parseOrThrow. See Monarch documentation for the other available methods.

About

A clean, simple, lightweight HTML parser.

Topics

Resources

Stars

Watchers

Forks

Contributors