HTMLTrust Canonicalization

Canonical text normalization for the HTMLTrust content signing framework. Produces a stable, deterministic text representation so that the same content always hashes to the same value — regardless of which CMS, editor, or platform produced it.

All implementations follow the same specification and pass the same verification test suite.

Why Canonicalization?

Content management systems silently transform text in ways that break naive hashing:

WordPress converts "straight quotes" to "curly quotes"
Google Docs converts -- to em dashes —
Rich text editors swap ... for the ellipsis character …
Copy-paste introduces invisible Unicode characters (ZWSP, BOM, bidi marks)
CJK editors interchange fullwidth and halfwidth forms

Without canonicalization, the same authored content produces different hashes depending on which tool touched it last. This library normalizes all of these variations to a single canonical form.

Implementations

Language	Path	Dependencies	Usage
JavaScript	`javascript/`	None (browser + Node.js)	Browser extension, Hugo signing script
Go	`go/`	`golang.org/x/text` (NFKC)	Hugo module
PHP	`php/`	`ext-intl`, `ext-mbstring`	WordPress plugin

All implementations produce identical output for the same input.

The 8 Phases

Phase	What It Does
1. NFKC	Unicode NFKC normalization — handles ligatures, fullwidth/halfwidth, presentation forms, superscripts, CJK compatibility, Jamo composition
2. Whitespace	All Unicode whitespace (30+ characters) → ASCII space; collapse runs; trim
3. Quotation Marks	Curly quotes, guillemets, CJK corner brackets → ASCII straight quotes
4. Dashes	En dash, em dash, figure dash, non-breaking hyphen → ASCII hyphen-minus
5. Punctuation	Ellipsis `…` → `...`; minus sign → hyphen-minus
6. Strip Invisibles	Remove soft hyphens, zero-width spaces, BOM, variation selectors, bidi controls, Arabic tatweel
7. Bidi	Remove all bidi control characters (rely on HTML `dir` attribute instead)
8. Language-Specific	Preserve ZWNJ (semantic in Persian/Kurdish), ZWJ (semantic in Indic/emoji), Arabic diacritics, Hebrew nikud

Quick Start

JavaScript (Browser / Node.js)

import { normalizeText } from '@htmltrust/canonicalization';

const canonical = normalizeText('He said, \u201CHello\u2026\u201D');
// → 'He said, "Hello..."'

Go

import "github.com/HTMLTrust/htmltrust-canonicalization/go"

canonical := canonicalize.Normalize("He said, \u201CHello\u2026\u201D")
// → "He said, \"Hello...\""

PHP

use HTMLTrust\Canonicalization\Canonicalize;

$canonical = Canonicalize::normalize("He said, \u{201C}Hello\u{2026}\u{201D}");
// → 'He said, "Hello..."'

Verification Checklist

All implementations must produce identical output for these test pairs:

Input A	Input B	Same After Normalization?
`"Hello"` (curly quotes)	`"Hello"` (straight)	✅ Yes
`café` (precomposed)	`café` (combining)	✅ Yes
`ﬁnd` (fi ligature)	`find`	✅ Yes
`word — word` (em dash)	`word - word`	✅ Yes
`«Bonjour»` (guillemets)	`"Bonjour"`	✅ Yes
`「東京」` (CJK brackets)	`"東京"`	✅ Yes
`می‌خواهم` (with ZWNJ)	`میخواهم` (without)	❌ No — ZWNJ is semantic
`كتـــاب` (with tatweel)	`كتاب`	✅ Yes
`Ａ１` (fullwidth)	`A1`	✅ Yes
`wordword` (with ZWSP)	`wordword`	✅ Yes
`word‌word` (with ZWNJ)	`wordword`	❌ No — ZWNJ is semantic

Running Tests

# JavaScript
cd javascript && node test.js

# Go
cd go && go test -v ./...

# PHP
cd php && composer install && composer test

Companion Repositories

Repository	Description
htmltrust-spec	The HTMLTrust specification and paper
htmltrust-server-reference	Reference trust directory API server
htmltrust-browser-reference	Reference browser extension
htmltrust-cms-reference	Reference CMS plugins (WordPress, Hugo)
htmltrust-website	Project website

License

This project is licensed under the PolyForm Noncommercial License 1.0.0. You may use, modify, and share the software for any noncommercial purpose with attribution. Commercial use requires a separate agreement with the licensor.

Origin & Contributions

HTMLTrust is an idea I (Jason Grey) have been chewing on since 2024. I'm not an academic — I'm an engineer with a day job and a family — so the spec, the reference implementations, and most of this prose have been written with significant help from AI tools acting as research assistant, technical writer, and pair programmer. I wrote the original architectural sketches and reviewed every line; the assistants filled in the gaps and saved me from re-typing the same explanation for the hundredth time.

Contributions are welcome — human or AI-assisted, doesn't matter to me. What matters is whether the code, the spec text, or the conformance vectors move the project forward. Open a PR.

What this project is not a forum for:

Debates about whether AI should be used to write code or specifications.
Opinions on who is or isn't trustworthy on the web.
Politics, religion, professional practice, or personal philosophy.

HTMLTrust is a mechanism — a way for anyone to sign content they publish and for anyone to decide whom they trust, on their own terms. The project takes no position on what the right answers are; it just provides the tools. If you want to debate the answers, there are entire continents of the internet better suited to it.

If this work is useful to you and you'd like to support it, see GitHub Sponsors or the other channels in .github/FUNDING.yml.

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
.devcontainer		.devcontainer
.github		.github
conformance		conformance
go		go
javascript		javascript
php		php
python		python
rust		rust
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
composer.json		composer.json
package.json		package.json
spec.md		spec.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

HTMLTrust Canonicalization

Why Canonicalization?

Implementations

The 8 Phases

Quick Start

JavaScript (Browser / Node.js)

Go

PHP

Verification Checklist

Running Tests

Companion Repositories

License

Origin & Contributions

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

HTMLTrust Canonicalization

Why Canonicalization?

Implementations

The 8 Phases

Quick Start

JavaScript (Browser / Node.js)

Go

PHP

Verification Checklist

Running Tests

Companion Repositories

License

Origin & Contributions

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages