-
Notifications
You must be signed in to change notification settings - Fork 43
Handling of UTF-16 surrogate pairs / 2x 16-bit code units, for code points outside Unicode BMP (Basic Multilingual Plane) #42
Description
This is not a bug report, but a reminder to check whether we are "doing the right thing" in our CFI generator / processor code, and in the "annotations" (highlighter) plugin which handles DOM selections / ranges.
See:
w3c/epub-specs#555 (comment)
Verbatim reproduction of the comment linked above:
Yes, it is also my understanding that CFI "character offsets" are expressed relative to the number of 16-bits code units within the strings of characters encoded as UTF-16, not in terms of the actual number of code points (which are commonly referred-to as Unicode "characters").
This way, the processing of surrogate pairs (i.e. two 16-bits code units) for code points outside of Unicode BMP (Basic Multilingual Plane) must be explicit (no implicit normalization / conversion), which is compatible with the DOM Ranges API ( http://www.w3.org/TR/DOM-Level-2-Traversal-Range/ranges.html#Level-2-Range-Position-h3 ), and the Javascript String API (e.g. .length, .substr(), see http://www.ecma-international.org/ecma-262/6.0/#sec-ecmascript-language-types-string-type + http://www.ecma-international.org/ecma-262/6.0/#sec-string-objects, and .charAt() http://www.ecma-international.org/ecma-262/6.0/#sec-string.prototype.charat vs. .codePointAt() http://www.ecma-international.org/ecma-262/6.0/#sec-string.prototype.codepointat ).
Additional literature on the subject:
https://mathiasbynens.be/notes/javascript-encoding
http://www.2ality.com/2013/09/javascript-unicode.html
Popular library to deal with Unicode in Javascript:
https://github.com/bestiejs/punycode.js#punycodeucs2
In other words, a CFI library (such as Readium's own https://github.com/readium/readium-cfi-js ) effectively treats strings of characters as though they were encoded using UCS-2 (16-bits / 2-byte Universal Character Set), unaware of sequences of UTF-16 surrogate pairs potentially contained within.
Note that CFI "assertions" for text locations / ranges (i.e. based on the aforementioned CFI character offsets, which are counts of UTF-16 code units) are URI-escaped via UTF-8 encoding. See:
http://www.idpf.org/epub/linking/cfi/epub-cfi.html#sec-path-text-location
http://www.idpf.org/epub/linking/cfi/epub-cfi.html#sec-epubcfi-escaping
So, back to the proposed specification updates:
http://www.idpf.org/epub/linking/cfi/epub-cfi.html#sec-path-terminating-char
For XML character data, the offset is zero-based and always refers to a position between characters, so 0 means before the first character and a number equal to the total UTF-16 length means after the last character. A character offset value greater than the UTF-16 length of the available text must not be specified.
May I suggest the following edits? (I added a non-normative note)
In this specification, the definition of an "offset" within XML character data is based on the UTF-16 text encoding, whereby each "character" (Unicode code point) may be represented using a single 16-bit code unit, or two units (surrogate pairs, for Unicode characters outside of BMP / Basic Multilingual Plane) [ http://www.unicode.org ]. A CFI "character offset" is a zero-based number that refers to a position between UTF-16 code units. Here, the "length" of the text is the total count of 16-bit units. Offset zero therefore means before the first 16-bit unit, and a number equal to the "length" of the text means after the last 16-bit unit. An offset value greater than the "length" of the text must not be specified. NOTE: note to implementors: counting the number of text "characters" based on UTF-16 code units (instead of Unicode code points) is compatible with the DOM Range model [ http://www.w3.org/TR/DOM-Level-2-Traversal-Range/ranges.html#Level-2-Range-Position-h3 ], and with the ECMA / Javascript String API [ http://www.ecma-international.org/ecma-262/6.0/#sec-ecmascript-language-types-string-type ].