-
Notifications
You must be signed in to change notification settings - Fork 394
Description
Description
The JSON string parser currently fails when encountering UTF-16 surrogate pairs that represent Unicode characters outside the Basic Multilingual Plane (such as emojis). The parser only handles the first component of the surrogate pair and attempts to convert it directly to UTF-8, which results in an error since lone surrogates are invalid.
Expected Behavior
The parser should correctly handle UTF-16 surrogate pairs like \ud83d\udc51 (π crown emoji) by:
- Detecting high surrogates (0xD800-0xDBFF)
- Reading the corresponding low surrogate (0xDC00-0xDFFF)
- Converting the pair to the proper Unicode code point
- Successfully parsing the character
Current Behavior
The parser fails with an error when it encounters the high surrogate \ud83d because it tries to convert it directly to a char, which is invalid in UTF-8.
Minimal Example
// Test function
fn parse_json_string(input: &str) -> Result<String, LexerError> {
let mut lexer = Lexer::new(input, ParserLanguage::Json);
let mut r = String::new();
while !lexer.eof() {
r.push(
lexer
.next_json_char_value()?
);
}
Ok(r)
}
// This should parse successfully but currently fails
let json_string = r#""\ud83d\udc51""#; // Crown emoji: π
let result = parser.parse_string(json_string);
// Currently: Err(LexerError::IncorrectUnicodeChar)
// Expected: Ok("π")Test Cases
// These should all work:
assert_eq!(parse_json_string(r#""\ud83d\ude00""#), "π"); // Grinning face
assert_eq!(parse_json_string(r#""\ud83d\udc36""#), "πΆ"); // Dog face
assert_eq!(parse_json_string(r#""\ud83d\udc51""#), "π"); // Crown
// These should still fail (lone surrogates):
let str = parse_json_string(r#""\ud83d""#); // Lone high surrogate
assert!(str.is_err());
assert!(matches!(str.unwrap_err(), LexerError::ExpectedLowSurrogate));
let str = parse_json_string(r#""\ud83d\""#); // Lone high surrogate with incomplete low surrogate
assert!(str.is_err());
assert!(matches!(str.unwrap_err(), LexerError::ExpectedLowSurrogate));
let str = parse_json_string(r#""\ud83d\a""#); // Lone high surrogate with invalid low surrogate
assert!(str.is_err());
assert!(matches!(str.unwrap_err(), LexerError::ExpectedLowSurrogate));
let str = parse_json_string(r#""\udc51""#); // Lone low surrogate
assert!(str.is_err());
assert!(matches!(str.unwrap_err(), LexerError::ExpectedHighSurrogate));
let str = parse_json_string(r#""\ud83d\u0181""#); // Invalid low surrogate
assert!(str.is_err());
assert!(matches!(str.unwrap_err(), LexerError::InvalidLowSurrogate));Root Cause
In the next_json_char_value function, the 'u' branch only handles single 16-bit Unicode escapes:
'u' => {
let mut v = 0;
for _ in 0..4 {
let digit = self.next_hex_digit()?;
v = v * 16 + digit;
}
Self::char_try_from(v) // β Fails for surrogates
}Suggested Fix
The fix involves detecting surrogate pairs and handling them properly:
- Check if the parsed value is a high surrogate (0xD800-0xDBFF)
- If so, read the next
\uXXXXsequence as the low surrogate - Validate the low surrogate is in range (0xDC00-0xDFFF)
- Convert the pair using:
0x10000 + ((high & 0x3FF) << 10) + (low & 0x3FF)
Environment
- Rust version: 1.86.0
- Library version: 3.7.2
- Platform: Linux
Additional Context
This issue affects any JSON containing emoji or other Unicode characters outside the Basic Multilingual Plane that are encoded as UTF-16 surrogate pairs. This is common in JSON data from web APIs, especially social media platforms.
The JSON specification (RFC 7159) supports \uXXXX escape sequences, and many JSON generators (including JavaScript's JSON.stringify) will encode emojis as surrogate pairs when targeting ASCII-safe output.