feature COMBINE_UNICODE_SURROGATES_IN_UTF8 doesn't work when custom characterEscape is used #1398

stackunderflow111 · 2025-02-04T11:47:38Z

Version: 2.18.0+

Hi!

I believe I have found a bug for the COMBINE_UNICODE_SURROGATES_IN_UTF8 feature introduced in version 2.18. It doesn't work when custom characterEscapes is used.

An example:

    public static void main(String[] args) throws IOException {
        JsonFactory surrogatePairFactory = JsonFactory.builder()
                .build();
        JsonFactory utf8Factory = JsonFactory.builder()
                .enable(JsonWriteFeature.COMBINE_UNICODE_SURROGATES_IN_UTF8)
                .build();
        JsonFactory utf8FactoryWithCharacterEscapes = new JsonFactoryBuilder()
                .characterEscapes(JsonpCharacterEscapes.instance())
                .enable(JsonWriteFeature.COMBINE_UNICODE_SURROGATES_IN_UTF8)
                .build();
        System.out.println(writeEmoji(surrogatePairFactory));
        System.out.println(writeEmoji(utf8Factory));
        System.out.println(writeEmoji(utf8FactoryWithCharacterEscapes));
    }

    private static String writeEmoji(JsonFactory f) throws IOException {
        ByteArrayOutputStream out = new ByteArrayOutputStream();
        try (JsonGenerator gen = f.createGenerator(out)) {
            gen.writeStartObject();
            // 0x1F60A - emoji
            gen.writeStringField("test_emoji", new String(Character.toChars(0x1F60A)));
            gen.writeEndObject();
        }
        return out.toString(StandardCharsets.UTF_8);
    }

The output:

It's expected that the third line (printed by utf8FactoryWithCharacterEscapes) should be the same as the second line (printed by utf8Factory), but they are different.

The reason seems to be that when custom characterEscapes is used, the code calls the two _writeCustomStringSegment2() methods, shown below, which do not check the COMBINE_UNICODE_SURROGATES_IN_UTF8 feature.

jackson-core/src/main/java/com/fasterxml/jackson/core/json/UTF8JsonGenerator.java

Lines 1684 to 1739 in 9fcf1e7

    
           private final void _writeCustomStringSegment2(final char[] cbuf, int offset, final int end) throws IOException 
        
           { 
        
               // Ok: caller guarantees buffer can have room; but that may require flushing: 
        
               if ((_outputTail +  6 * (end - offset)) > _outputEnd) { 
        
                   _flushBuffer(); 
        
               } 
        
               int outputPtr = _outputTail; 
        
               final byte[] outputBuffer = _outputBuffer; 
        
               final int[] escCodes = _outputEscapes; 
        
               // may or may not have this limit 
        
               final int maxUnescaped = (_maximumNonEscapedChar <= 0) ? 0xFFFF : _maximumNonEscapedChar; 
        
               final CharacterEscapes customEscapes = _characterEscapes; // non-null 
        
               while (offset < end) { 
        
                   int ch = cbuf[offset++]; 
        
                   if (ch <= 0x7F) { 
        
                        if (escCodes[ch] == 0) { 
        
                            outputBuffer[outputPtr++] = (byte) ch; 
        
                            continue; 
        
                        } 
        
                        int escape = escCodes[ch]; 
        
                        if (escape > 0) { // 2-char escape, fine 
        
                            outputBuffer[outputPtr++] = BYTE_BACKSLASH; 
        
                            outputBuffer[outputPtr++] = (byte) escape; 
        
                        } else if (escape == CharacterEscapes.ESCAPE_CUSTOM) { 
        
                            SerializableString esc = customEscapes.getEscapeSequence(ch); 
        
                            if (esc == null) { 
        
                                _reportError("Invalid custom escape definitions; custom escape not found for character code 0x" 
        
                                        +Integer.toHexString(ch)+", although was supposed to have one"); 
        
                            } 
        
                            outputPtr = _writeCustomEscape(outputBuffer, outputPtr, esc, end-offset); 
        
                        } else { 
        
                            // ctrl-char, 6-byte escape... 
        
                            outputPtr = _writeGenericEscape(ch, outputPtr); 
        
                        } 
        
                        continue; 
        
                   } 
        
                   if (ch > maxUnescaped) { // [JACKSON-102] Allow forced escaping if non-ASCII (etc) chars: 
        
                       outputPtr = _writeGenericEscape(ch, outputPtr); 
        
                       continue; 
        
                   } 
        
                   SerializableString esc = customEscapes.getEscapeSequence(ch); 
        
                   if (esc != null) { 
        
                       outputPtr = _writeCustomEscape(outputBuffer, outputPtr, esc, end-offset); 
        
                       continue; 
        
                   } 
        
                   if (ch <= 0x7FF) { // fine, just needs 2 byte output 
        
                       outputBuffer[outputPtr++] = (byte) (0xc0 | (ch >> 6)); 
        
                       outputBuffer[outputPtr++] = (byte) (0x80 | (ch & 0x3f)); 
        
                   } else { 
        
                       outputPtr = _outputMultiByteChar(ch, outputPtr); 
        
                   } 
        
               } 
        
               _outputTail = outputPtr; 
        
           }

I believe the fix is easy, we can just port the changes we made in #1335 and #1360 to the two _writeCustomStringSegment2() methods. I am working on a pull request.

The text was updated successfully, but these errors were encountered:

stackunderflow111 mentioned this issue Feb 4, 2025

fix the surrogate utf8 feature when custom characterEscapes is used #1399

Merged

cowtowncoder added this to the 2.18.3 milestone Feb 5, 2025

cowtowncoder added the 2.18 Issues planned at earliest for 2.18 label Feb 5, 2025

cowtowncoder closed this as completed Feb 5, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feature COMBINE_UNICODE_SURROGATES_IN_UTF8 doesn't work when custom characterEscape is used #1398

feature COMBINE_UNICODE_SURROGATES_IN_UTF8 doesn't work when custom characterEscape is used #1398

stackunderflow111 commented Feb 4, 2025

feature COMBINE_UNICODE_SURROGATES_IN_UTF8 doesn't work when custom characterEscape is used #1398

feature COMBINE_UNICODE_SURROGATES_IN_UTF8 doesn't work when custom characterEscape is used #1398

Comments

stackunderflow111 commented Feb 4, 2025