You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This is kind of a "gotcha" type issue but I think my reasoning is sound.
nanojson parses Java character literals correctly (and goes to great
lengths to make sure the UTF-8 is valid) in the consumeTokenStringUtf8Char
method. But when the characters are escaped (not Java character literals)
it uses the following simplified logic:
which is not sufficient because there are illegal combinations of encoded
Unicode characters (as well as illegal single characters) which I will now attempt
to show/describe.
The JSON spec states that any character may be escaped. The method
to escape characters outside of the Basic Multilingual Plane (U+0000
through U+FFFF) is as follows:
`To escape an extended character that is not in the Basic Multilingual
Plane, the character is represented as a 12-character sequence,
encoding the UTF-16 surrogate pair. So, for example, a string
containing only the G clef character (U+1D11E) may be represented as
"\uD834\uDD1E".
In other words, characters above U+FFFF are encoded as surrogate pairs.
Thus, the following JSON: "\uD800\uDC00" should decode to the Unicode
character: U+10000.
The following JSON Strings:
1.) "\uD800"
2.) "\uD800\uCC00"
Should fail parsing because a high surrogate character (U+D800) is:
in 1.) not followed by another Unicode character
in 2.) not followed by a low surrogate character
In other words, the following tests should not fail:
@TestpublicvoidtestFailBustedString8() {
try {
// High-surrogate character not followed by another Unicode characterJsonParser.any().from("\"\\uD800\"");
fail();
} catch (JsonParserExceptione) {
testException(e, 1, 7); // 7 may be the wrong char, but is irrelevant for the issue
}
}
@TestpublicvoidtestFailBustedString9() {
try {
// High-surrogate character not followed by a low-surrogate characterJsonParser.any().from("\"\\uD800\\uCC00\"");
fail();
} catch (JsonParserExceptione) {
testException(e, 1, 7); // 7 may be the wrong char, but is irrelevant for the issue
}
}
Thanks for your time, let me know if anything is not clear.
The text was updated successfully, but these errors were encountered:
Great explanation. You are correct - the parser probably not accept these. However, as Javascript's JSON methods do support stringify and parse illegal surrogates (at least Firefox did in my quick test), this should probably be gated behind a parser flag that isn't set by default.
Ideally the parser and writer would have a "strict utf16 surrogate" mode, ie:
This is kind of a "gotcha" type issue but I think my reasoning is sound.
nanojson parses Java character literals correctly (and goes to great
lengths to make sure the UTF-8 is valid) in the
consumeTokenStringUtf8Char
method. But when the characters are escaped (not Java character literals)
it uses the following simplified logic:
which is not sufficient because there are illegal combinations of encoded
Unicode characters (as well as illegal single characters) which I will now attempt
to show/describe.
The JSON spec states that any character may be escaped. The method
to escape characters outside of the Basic Multilingual Plane (U+0000
through U+FFFF) is as follows:
`To escape an extended character that is not in the Basic Multilingual
Plane, the character is represented as a 12-character sequence,
encoding the UTF-16 surrogate pair. So, for example, a string
containing only the G clef character (U+1D11E) may be represented as
"\uD834\uDD1E".
In other words, characters above U+FFFF are encoded as surrogate pairs.
Thus, the following JSON: "\uD800\uDC00" should decode to the Unicode
character: U+10000.
The following JSON Strings:
1.)
"\uD800"
2.)
"\uD800\uCC00"
Should fail parsing because a high surrogate character (U+D800) is:
in 1.) not followed by another Unicode character
in 2.) not followed by a low surrogate character
In other words, the following tests should not fail:
Thanks for your time, let me know if anything is not clear.
The text was updated successfully, but these errors were encountered: