Encode and decode WTF-8 with a similar API to TextEncoder and TextDecoder.
The goal is to be able to parse and generate bytestreams that can store any JavaScript string, including ones that have unpaired surrogates.
npm install @cto.af/wtf8
Full API documentation is available.
Example:
import {Wtf8Decoder, Wtf8Encoder} from '@cto.af/wtf8';
const bytes = new Wtf8Encoder().encode('\ud800');
const string = new Wtf8Decoder().decode(bytes); // '\ud800'
W3C streams are also provided: Wtf8EncoderStream
and Wtf8DecoderStream
.
Used a few of the tricks from the paper Validating UTF-8 In Less Than One Instruction Per Byte, but not all of them. Moving data in and out of WASM to be able to use SIMD might be slightly faster, but since we're not merely validating but instead actually decoding (and generating replacement characters when fatal is false), staying in JS seems good enough for the moment.