Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Editorial: revamp the way we deal with code points and bytes #247

Draft
wants to merge 2 commits into
base: main
Choose a base branch
from
Draft
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
42 changes: 23 additions & 19 deletions encoding.bs
Original file line number Diff line number Diff line change
Expand Up @@ -1299,7 +1299,7 @@ interface mixin TextDecoderCommon {
<ol>
<li><p>Set <var>decoder</var>'s <a for=TextDecoderCommon>BOM seen</a> to true.

<li><p>If <var>item</var> is U+FEFF, then <a for=iteration>continue</a>.
<li><p>If <var>item</var> is U+FEFF BOM, then <a for=iteration>continue</a>.
</ol>

<li><p>Append <var>item</var> to <var>output</var>.
Expand Down Expand Up @@ -1607,7 +1607,8 @@ method steps are:
<var>written</var> is greater than or equal to the number of bytes in <var>result</var>, then:

<ol>
<li><p>If <var>item</var> is greater than U+FFFF, then increment <var>read</var> by 2.
<li><p>If <var>item</var>'s <a for="code point">value</a> is greater than 0xFFFF, then
increment <var>read</var> by 2.

<li><p>Otherwise, increment <var>read</var> by 1.

Expand Down Expand Up @@ -1915,7 +1916,7 @@ constructor steps are:
<p class=note>{{DOMString}}, as well as an <a for=/>I/O queue</a> of code units rather than scalar
values, are used here so that a surrogate pair that is split between chunks can be reassembled into
the appropriate scalar value. The behavior is otherwise identical to {{USVString}}. In particular,
lone surrogates will be replaced with U+FFFD.
lone surrogates will be replaced with U+FFFD (�).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In Charmod we often followed the convention:

� [U+FFFD REPLACEMENT CHARACTER]

(with the [U+xxxx character name] part styled distinctly). I say "often" because I willfully ignored the convention whenever it reduced clarity, particularly with long sequences used in this or that example. For examples this like, you might consider something similar, since it makes the text unambiguous?

OTOH, I find this pretty clear and am not sure that the charmod style adds that much. I like quoting the character like this when it's printable.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We made up our own convention in https://infra.spec.whatwg.org/#code-points since we found the one in Charmod a bit too verbose, iirc.


<li><p>Let <var>output</var> be the <a for=/>I/O queue</a> of bytes « <a>end-of-queue</a> ».

Expand Down Expand Up @@ -1973,13 +1974,13 @@ constructor steps are:

<li><p><a>Prepend</a> <var>item</var> to <var>input</var>.

<li><p>Return U+FFFD.
<li><p>Return U+FFFD (�).
</ol>

<li><p>If <var>item</var> is in the range U+D800 to U+DBFF, inclusive, then set <a>pending high
surrogate</a> to <var>item</var> and return <a>continue</a>.

<li><p>If <var>item</var> is in the range U+DC00 to U+DFFF, inclusive, then return U+FFFD.
<li><p>If <var>item</var> is in the range U+DC00 to U+DFFF, inclusive, then return U+FFFD (�).

<li><p>Return <var>item</var>.
</ol>
Expand Down Expand Up @@ -2390,10 +2391,10 @@ consumers of content generated with <a>GBK</a>'s <a for=/>encoder</a>.
<li><p>Return <a>error</a>.
</ol>

<li><p>If <var>byte</var> is an <a>ASCII byte</a>, return
a code point whose value is <var>byte</var>.
<li><p>If <var>byte</var> is an <a>ASCII byte</a>, then return a <a>code point</a> whose
<a for="code point">value</a> is <var>byte</var>.

<li><p>If <var>byte</var> is 0x80, return code point U+20AC.
<li><p>If <var>byte</var> is 0x80, then return U+20AC (€).

<li><p>If <var>byte</var> is in the range 0x81 to 0xFE, inclusive, set
<a>gb18030 first</a> to <var>byte</var> and return <a>continue</a>.
Expand Down Expand Up @@ -3345,13 +3346,15 @@ https://stackoverflow.com/questions/6986789/why-are-some-bytes-prefixed-with-0xf
<var>ioQueue</var> and <var>byte</var>, runs these steps:

<ol>
<li><p>If <var>byte</var> is <a>end-of-queue</a>, return
<a>finished</a>.
<li><p>If <var>byte</var> is <a>end-of-queue</a>, then return <a>finished</a>.

<li><p>If <var>byte</var> is an <a>ASCII byte</a>, return
a code point whose value is <var>byte</var>.
<li><p>Let <var>byteValue</var> be <var>byte</var>'s <a for=byte>value</a>.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is byteValue really needed vs. just saying things like:

If byte is an ASCII byte, then return a code point whose value is byte's value.

I realize that "code point's value" is a different integer type than "byte's value", but we mean the number in any case.


<li><p>Return a code point whose value is 0xF780 + <var>byte</var> &minus; 0x80.
<li><p>If <var>byte</var> is an <a>ASCII byte</a>, then return a <a>code point</a> whose
<a for="code point">value</a> is <var>byteValue</var>.

<li><p>Return a <a>code point</a> whose <a for="code point">value</a> is
0xF780 + <var>byteValue</var> &minus; 0x80.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see the problem. You don't want prose here. But can't we just say 0xF780 + byte - 0x80?

Is there a reason I'm not seeing for why we don't just make the number 0xF700? Is the reason to emphasize that we're trying to get to/from bytes >= 0x80?

Copy link
Member Author

@annevk annevk Nov 3, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We've had some cases where we want to distinguish bytes from numbers. So the question is whether we want to do that here as well. And I guess in some sense we do since we want to return code points or bytes, but a lot of the calculations are on numbers.

I think we could use byte in the calculation directly (as we already did), but it wouldn't really be logically consistent with how we talk about bytes and numbers elsewhere in the web platform.

(I guess another way would be that we say that in equations they are casted to their value.)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could define implicit conversions code point → number and byte → number (whatwg/infra#319) and perhaps the other way around too. But even if we don't, we could use short algorithmic phrases inside the formula: "0xF780 + (byte's value) − 0x80".

There are other formulas in the standard that use byte or code point values directly, though, and they should be changed accordingly. (Interestingly, there are formulas dealing with code units around TextEncoder and TextEncoderStream, which don't have this problem because code units seem to be defined directly as a number type.)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW I intuitively like making the code point <—> byte/number conversions explicit, and don't see as much of a need for distinguishing bytes and numbers. (I'd be OK defining bytes as a subtype of numbers, if we ever make progress on defining numbers.)

</ol>


Expand All @@ -3361,14 +3364,15 @@ https://stackoverflow.com/questions/6986789/why-are-some-bytes-prefixed-with-0xf
<var>ioQueue</var> and <var>code point</var>, runs these steps:

<ol>
<li><p>If <var>code point</var> is <a>end-of-queue</a>, return
<a>finished</a>.
<li><p>If <var>code point</var> is <a>end-of-queue</a>, then return <a>finished</a>.

<li><p>If <var>code point</var> is an <a>ASCII code point</a>, return
a byte whose value is <var>code point</var>.
<li><p>Let <var>codePointValue</var> be <var>code point</var>'s <a for="code point">value</a>.

<li><p>If <var>code point</var> is an <a>ASCII code point</a>, then return a <a>byte</a> whose
<a for=byte>value</a> is <var>codePointValue</var>.

<li><p>If <var>code point</var> is in the range U+F780 to U+F7FF, inclusive, return
a byte whose value is <var>code point</var> &minus; 0xF780 + 0x80.
<li><p>If <var>codePointValue</var> is in the range 0xF780 to 0xF7FF, inclusive, then return a
<a>byte</a> whose <a for=byte>value</a> is <var>codePointValue</var> &minus; 0xF780 + 0x80.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

usw.


<li><p>Return <a>error</a> with <var>code point</var>.
</ol>
Expand Down