Handling of UTF-16 surrogate pairs / 2x 16-bit code units, for code points outside Unicode BMP (Basic Multilingual Plane)

This is not a bug report, but a reminder to check whether we are "doing the right thing" in our CFI generator / processor code, and in the "annotations" (highlighter) plugin which handles DOM selections / ranges.

See:
https://github.com/IDPF/epub-revision/issues/555#issuecomment-144962949

Verbatim reproduction of the comment linked above:

---

Yes, it is also my understanding that CFI "character offsets" are expressed relative to the number of 16-bits _code units_ within the strings of characters encoded as UTF-16, not in terms of the actual number of _code points_ (which are commonly referred-to as Unicode "characters").

This way, the processing of surrogate pairs (i.e. two 16-bits code units) for code points outside of Unicode BMP (Basic Multilingual Plane) must be explicit (no implicit normalization / conversion), which is compatible with the DOM Ranges API ( http://www.w3.org/TR/DOM-Level-2-Traversal-Range/ranges.html#Level-2-Range-Position-h3 ), and the Javascript String API (e.g. `.length`, `.substr()`, see http://www.ecma-international.org/ecma-262/6.0/#sec-ecmascript-language-types-string-type + http://www.ecma-international.org/ecma-262/6.0/#sec-string-objects, and `.charAt()` http://www.ecma-international.org/ecma-262/6.0/#sec-string.prototype.charat vs. `.codePointAt()` http://www.ecma-international.org/ecma-262/6.0/#sec-string.prototype.codepointat ). 

Additional literature on the subject:
https://mathiasbynens.be/notes/javascript-encoding
http://www.2ality.com/2013/09/javascript-unicode.html

Popular library to deal with Unicode in Javascript:
https://github.com/bestiejs/punycode.js#punycodeucs2

In other words, a CFI library (such as Readium's own https://github.com/readium/readium-cfi-js ) effectively treats strings of characters as though they were encoded using `UCS-2` (16-bits / 2-byte Universal Character Set), unaware of sequences of UTF-16 surrogate pairs potentially contained within.
Note that CFI "assertions" for text locations / ranges (i.e. based on the aforementioned CFI character offsets, which are counts of UTF-16 code units) are URI-escaped via UTF-8 encoding. See:
http://www.idpf.org/epub/linking/cfi/epub-cfi.html#sec-path-text-location
http://www.idpf.org/epub/linking/cfi/epub-cfi.html#sec-epubcfi-escaping

So, back to the proposed specification updates:

http://www.idpf.org/epub/linking/cfi/epub-cfi.html#sec-path-terminating-char

> For XML character data, the offset is zero-based and always refers to a position between characters, so 0 means before the first character and a number equal to the total UTF-16 length means after the last character. A character offset value greater than the UTF-16 length of the available text must not be specified.

May I suggest the following edits? (I added a non-normative note)

> In this specification, the definition of an "offset" within XML character data is based on the UTF-16 text encoding, whereby each "character" (Unicode code point) may be represented using a single 16-bit code unit, or two units (surrogate pairs, for Unicode characters outside of BMP / Basic Multilingual Plane) [ http://www.unicode.org ]. A CFI "character offset" is a zero-based number that refers to a position _between_ UTF-16 code units. Here, the "length" of the text is the total count of 16-bit units. Offset zero therefore means _before_ the first 16-bit unit, and a number equal to the "length" of the text means _after_ the last 16-bit unit. An offset value greater than the "length" of the text must not be specified. NOTE: note to implementors: counting the number of text "characters" based on UTF-16 _code units_ (instead of Unicode _code points_) is compatible with the DOM Range model [ http://www.w3.org/TR/DOM-Level-2-Traversal-Range/ranges.html#Level-2-Range-Position-h3 ], and with the ECMA / Javascript String API [ http://www.ecma-international.org/ecma-262/6.0/#sec-ecmascript-language-types-string-type ]. 

---


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Handling of UTF-16 surrogate pairs / 2x 16-bit code units, for code points outside Unicode BMP (Basic Multilingual Plane) #42

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Handling of UTF-16 surrogate pairs / 2x 16-bit code units, for code points outside Unicode BMP (Basic Multilingual Plane) #42

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions