Skip to content

JSON Parser Fails to Handle UTF-16 Surrogate Pairs (Emojis)Β #768

@fcostello85

Description

@fcostello85

Description

The JSON string parser currently fails when encountering UTF-16 surrogate pairs that represent Unicode characters outside the Basic Multilingual Plane (such as emojis). The parser only handles the first component of the surrogate pair and attempts to convert it directly to UTF-8, which results in an error since lone surrogates are invalid.

Expected Behavior

The parser should correctly handle UTF-16 surrogate pairs like \ud83d\udc51 (πŸ‘‘ crown emoji) by:

  1. Detecting high surrogates (0xD800-0xDBFF)
  2. Reading the corresponding low surrogate (0xDC00-0xDFFF)
  3. Converting the pair to the proper Unicode code point
  4. Successfully parsing the character

Current Behavior

The parser fails with an error when it encounters the high surrogate \ud83d because it tries to convert it directly to a char, which is invalid in UTF-8.

Minimal Example

// Test function
fn parse_json_string(input: &str) -> Result<String, LexerError> {
    let mut lexer = Lexer::new(input, ParserLanguage::Json);
    let mut r = String::new();
    while !lexer.eof() {
        r.push(
             lexer
                .next_json_char_value()?
        );
    }
    Ok(r)
}

// This should parse successfully but currently fails
let json_string = r#""\ud83d\udc51""#;  // Crown emoji: πŸ‘‘
let result = parser.parse_string(json_string);

// Currently: Err(LexerError::IncorrectUnicodeChar)
// Expected: Ok("πŸ‘‘")

Test Cases

// These should all work:
assert_eq!(parse_json_string(r#""\ud83d\ude00""#), "πŸ˜€");  // Grinning face
assert_eq!(parse_json_string(r#""\ud83d\udc36""#), "🐢");  // Dog face  
assert_eq!(parse_json_string(r#""\ud83d\udc51""#), "πŸ‘‘");  // Crown

// These should still fail (lone surrogates):
let str = parse_json_string(r#""\ud83d""#); // Lone high surrogate
assert!(str.is_err());
assert!(matches!(str.unwrap_err(), LexerError::ExpectedLowSurrogate));

let str = parse_json_string(r#""\ud83d\""#); // Lone high surrogate with incomplete low surrogate
assert!(str.is_err());
assert!(matches!(str.unwrap_err(), LexerError::ExpectedLowSurrogate));

let str = parse_json_string(r#""\ud83d\a""#); // Lone high surrogate with invalid low surrogate
assert!(str.is_err());
assert!(matches!(str.unwrap_err(), LexerError::ExpectedLowSurrogate));

let str = parse_json_string(r#""\udc51""#); // Lone low surrogate
assert!(str.is_err());
assert!(matches!(str.unwrap_err(), LexerError::ExpectedHighSurrogate));

let str = parse_json_string(r#""\ud83d\u0181""#); // Invalid low surrogate
assert!(str.is_err());
assert!(matches!(str.unwrap_err(), LexerError::InvalidLowSurrogate));

Root Cause

In the next_json_char_value function, the 'u' branch only handles single 16-bit Unicode escapes:

'u' => {
    let mut v = 0;
    for _ in 0..4 {
        let digit = self.next_hex_digit()?;
        v = v * 16 + digit;
    }
    Self::char_try_from(v)  // ← Fails for surrogates
}

Suggested Fix

The fix involves detecting surrogate pairs and handling them properly:

  1. Check if the parsed value is a high surrogate (0xD800-0xDBFF)
  2. If so, read the next \uXXXX sequence as the low surrogate
  3. Validate the low surrogate is in range (0xDC00-0xDFFF)
  4. Convert the pair using: 0x10000 + ((high & 0x3FF) << 10) + (low & 0x3FF)

Environment

  • Rust version: 1.86.0
  • Library version: 3.7.2
  • Platform: Linux

Additional Context

This issue affects any JSON containing emoji or other Unicode characters outside the Basic Multilingual Plane that are encoded as UTF-16 surrogate pairs. This is common in JSON data from web APIs, especially social media platforms.

The JSON specification (RFC 7159) supports \uXXXX escape sequences, and many JSON generators (including JavaScript's JSON.stringify) will encode emojis as surrogate pairs when targeting ASCII-safe output.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions