-
Notifications
You must be signed in to change notification settings - Fork 78
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Editorial: revamp the way we deal with code points and bytes #247
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -1299,7 +1299,7 @@ interface mixin TextDecoderCommon { | |
<ol> | ||
<li><p>Set <var>decoder</var>'s <a for=TextDecoderCommon>BOM seen</a> to true. | ||
|
||
<li><p>If <var>item</var> is U+FEFF, then <a for=iteration>continue</a>. | ||
<li><p>If <var>item</var> is U+FEFF BOM, then <a for=iteration>continue</a>. | ||
</ol> | ||
|
||
<li><p>Append <var>item</var> to <var>output</var>. | ||
|
@@ -1607,7 +1607,8 @@ method steps are: | |
<var>written</var> is greater than or equal to the number of bytes in <var>result</var>, then: | ||
|
||
<ol> | ||
<li><p>If <var>item</var> is greater than U+FFFF, then increment <var>read</var> by 2. | ||
<li><p>If <var>item</var>'s <a for="code point">value</a> is greater than 0xFFFF, then | ||
increment <var>read</var> by 2. | ||
|
||
<li><p>Otherwise, increment <var>read</var> by 1. | ||
|
||
|
@@ -1915,7 +1916,7 @@ constructor steps are: | |
<p class=note>{{DOMString}}, as well as an <a for=/>I/O queue</a> of code units rather than scalar | ||
values, are used here so that a surrogate pair that is split between chunks can be reassembled into | ||
the appropriate scalar value. The behavior is otherwise identical to {{USVString}}. In particular, | ||
lone surrogates will be replaced with U+FFFD. | ||
lone surrogates will be replaced with U+FFFD (�). | ||
|
||
<li><p>Let <var>output</var> be the <a for=/>I/O queue</a> of bytes « <a>end-of-queue</a> ». | ||
|
||
|
@@ -1973,13 +1974,13 @@ constructor steps are: | |
|
||
<li><p><a>Prepend</a> <var>item</var> to <var>input</var>. | ||
|
||
<li><p>Return U+FFFD. | ||
<li><p>Return U+FFFD (�). | ||
</ol> | ||
|
||
<li><p>If <var>item</var> is in the range U+D800 to U+DBFF, inclusive, then set <a>pending high | ||
surrogate</a> to <var>item</var> and return <a>continue</a>. | ||
|
||
<li><p>If <var>item</var> is in the range U+DC00 to U+DFFF, inclusive, then return U+FFFD. | ||
<li><p>If <var>item</var> is in the range U+DC00 to U+DFFF, inclusive, then return U+FFFD (�). | ||
|
||
<li><p>Return <var>item</var>. | ||
</ol> | ||
|
@@ -2390,10 +2391,10 @@ consumers of content generated with <a>GBK</a>'s <a for=/>encoder</a>. | |
<li><p>Return <a>error</a>. | ||
</ol> | ||
|
||
<li><p>If <var>byte</var> is an <a>ASCII byte</a>, return | ||
a code point whose value is <var>byte</var>. | ||
<li><p>If <var>byte</var> is an <a>ASCII byte</a>, then return a <a>code point</a> whose | ||
<a for="code point">value</a> is <var>byte</var>. | ||
|
||
<li><p>If <var>byte</var> is 0x80, return code point U+20AC. | ||
<li><p>If <var>byte</var> is 0x80, then return U+20AC (€). | ||
|
||
<li><p>If <var>byte</var> is in the range 0x81 to 0xFE, inclusive, set | ||
<a>gb18030 first</a> to <var>byte</var> and return <a>continue</a>. | ||
|
@@ -3345,13 +3346,15 @@ https://stackoverflow.com/questions/6986789/why-are-some-bytes-prefixed-with-0xf | |
<var>ioQueue</var> and <var>byte</var>, runs these steps: | ||
|
||
<ol> | ||
<li><p>If <var>byte</var> is <a>end-of-queue</a>, return | ||
<a>finished</a>. | ||
<li><p>If <var>byte</var> is <a>end-of-queue</a>, then return <a>finished</a>. | ||
|
||
<li><p>If <var>byte</var> is an <a>ASCII byte</a>, return | ||
a code point whose value is <var>byte</var>. | ||
<li><p>Let <var>byteValue</var> be <var>byte</var>'s <a for=byte>value</a>. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. is
I realize that "code point's value" is a different integer type than "byte's value", but we mean the number in any case. |
||
|
||
<li><p>Return a code point whose value is 0xF780 + <var>byte</var> − 0x80. | ||
<li><p>If <var>byte</var> is an <a>ASCII byte</a>, then return a <a>code point</a> whose | ||
<a for="code point">value</a> is <var>byteValue</var>. | ||
|
||
<li><p>Return a <a>code point</a> whose <a for="code point">value</a> is | ||
0xF780 + <var>byteValue</var> − 0x80. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I see the problem. You don't want prose here. But can't we just say Is there a reason I'm not seeing for why we don't just make the number There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We've had some cases where we want to distinguish bytes from numbers. So the question is whether we want to do that here as well. And I guess in some sense we do since we want to return code points or bytes, but a lot of the calculations are on numbers. I think we could use byte in the calculation directly (as we already did), but it wouldn't really be logically consistent with how we talk about bytes and numbers elsewhere in the web platform. (I guess another way would be that we say that in equations they are casted to their value.) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We could define implicit conversions code point → number and byte → number (whatwg/infra#319) and perhaps the other way around too. But even if we don't, we could use short algorithmic phrases inside the formula: "0xF780 + (byte's value) − 0x80". There are other formulas in the standard that use byte or code point values directly, though, and they should be changed accordingly. (Interestingly, there are formulas dealing with code units around TextEncoder and TextEncoderStream, which don't have this problem because code units seem to be defined directly as a number type.) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. FWIW I intuitively like making the code point <—> byte/number conversions explicit, and don't see as much of a need for distinguishing bytes and numbers. (I'd be OK defining bytes as a subtype of numbers, if we ever make progress on defining numbers.) |
||
</ol> | ||
|
||
|
||
|
@@ -3361,14 +3364,15 @@ https://stackoverflow.com/questions/6986789/why-are-some-bytes-prefixed-with-0xf | |
<var>ioQueue</var> and <var>code point</var>, runs these steps: | ||
|
||
<ol> | ||
<li><p>If <var>code point</var> is <a>end-of-queue</a>, return | ||
<a>finished</a>. | ||
<li><p>If <var>code point</var> is <a>end-of-queue</a>, then return <a>finished</a>. | ||
|
||
<li><p>If <var>code point</var> is an <a>ASCII code point</a>, return | ||
a byte whose value is <var>code point</var>. | ||
<li><p>Let <var>codePointValue</var> be <var>code point</var>'s <a for="code point">value</a>. | ||
|
||
<li><p>If <var>code point</var> is an <a>ASCII code point</a>, then return a <a>byte</a> whose | ||
<a for=byte>value</a> is <var>codePointValue</var>. | ||
|
||
<li><p>If <var>code point</var> is in the range U+F780 to U+F7FF, inclusive, return | ||
a byte whose value is <var>code point</var> − 0xF780 + 0x80. | ||
<li><p>If <var>codePointValue</var> is in the range 0xF780 to 0xF7FF, inclusive, then return a | ||
<a>byte</a> whose <a for=byte>value</a> is <var>codePointValue</var> − 0xF780 + 0x80. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. usw. |
||
|
||
<li><p>Return <a>error</a> with <var>code point</var>. | ||
</ol> | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In Charmod we often followed the convention:
(with the
[U+xxxx character name]
part styled distinctly). I say "often" because I willfully ignored the convention whenever it reduced clarity, particularly with long sequences used in this or that example. For examples this like, you might consider something similar, since it makes the text unambiguous?OTOH, I find this pretty clear and am not sure that the charmod style adds that much. I like quoting the character like this when it's printable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We made up our own convention in https://infra.spec.whatwg.org/#code-points since we found the one in Charmod a bit too verbose, iirc.