Skip to content

feature COMBINE_UNICODE_SURROGATES_IN_UTF8 doesn't work when custom characterEscape is used #1398

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
stackunderflow111 opened this issue Feb 4, 2025 · 0 comments
Labels
2.18 Issues planned at earliest for 2.18
Milestone

Comments

@stackunderflow111
Copy link
Contributor

Version: 2.18.0+

Hi!

I believe I have found a bug for the COMBINE_UNICODE_SURROGATES_IN_UTF8 feature introduced in version 2.18. It doesn't work when custom characterEscapes is used.

An example:

    public static void main(String[] args) throws IOException {
        JsonFactory surrogatePairFactory = JsonFactory.builder()
                .build();
        JsonFactory utf8Factory = JsonFactory.builder()
                .enable(JsonWriteFeature.COMBINE_UNICODE_SURROGATES_IN_UTF8)
                .build();
        JsonFactory utf8FactoryWithCharacterEscapes = new JsonFactoryBuilder()
                .characterEscapes(JsonpCharacterEscapes.instance())
                .enable(JsonWriteFeature.COMBINE_UNICODE_SURROGATES_IN_UTF8)
                .build();
        System.out.println(writeEmoji(surrogatePairFactory));
        System.out.println(writeEmoji(utf8Factory));
        System.out.println(writeEmoji(utf8FactoryWithCharacterEscapes));
    }

    private static String writeEmoji(JsonFactory f) throws IOException {
        ByteArrayOutputStream out = new ByteArrayOutputStream();
        try (JsonGenerator gen = f.createGenerator(out)) {
            gen.writeStartObject();
            // 0x1F60A - emoji
            gen.writeStringField("test_emoji", new String(Character.toChars(0x1F60A)));
            gen.writeEndObject();
        }
        return out.toString(StandardCharsets.UTF_8);
    }

The output:

Image

It's expected that the third line (printed by utf8FactoryWithCharacterEscapes) should be the same as the second line (printed by utf8Factory), but they are different.

The reason seems to be that when custom characterEscapes is used, the code calls the two _writeCustomStringSegment2() methods, shown below, which do not check the COMBINE_UNICODE_SURROGATES_IN_UTF8 feature.

private final void _writeCustomStringSegment2(final char[] cbuf, int offset, final int end) throws IOException
{
// Ok: caller guarantees buffer can have room; but that may require flushing:
if ((_outputTail + 6 * (end - offset)) > _outputEnd) {
_flushBuffer();
}
int outputPtr = _outputTail;
final byte[] outputBuffer = _outputBuffer;
final int[] escCodes = _outputEscapes;
// may or may not have this limit
final int maxUnescaped = (_maximumNonEscapedChar <= 0) ? 0xFFFF : _maximumNonEscapedChar;
final CharacterEscapes customEscapes = _characterEscapes; // non-null
while (offset < end) {
int ch = cbuf[offset++];
if (ch <= 0x7F) {
if (escCodes[ch] == 0) {
outputBuffer[outputPtr++] = (byte) ch;
continue;
}
int escape = escCodes[ch];
if (escape > 0) { // 2-char escape, fine
outputBuffer[outputPtr++] = BYTE_BACKSLASH;
outputBuffer[outputPtr++] = (byte) escape;
} else if (escape == CharacterEscapes.ESCAPE_CUSTOM) {
SerializableString esc = customEscapes.getEscapeSequence(ch);
if (esc == null) {
_reportError("Invalid custom escape definitions; custom escape not found for character code 0x"
+Integer.toHexString(ch)+", although was supposed to have one");
}
outputPtr = _writeCustomEscape(outputBuffer, outputPtr, esc, end-offset);
} else {
// ctrl-char, 6-byte escape...
outputPtr = _writeGenericEscape(ch, outputPtr);
}
continue;
}
if (ch > maxUnescaped) { // [JACKSON-102] Allow forced escaping if non-ASCII (etc) chars:
outputPtr = _writeGenericEscape(ch, outputPtr);
continue;
}
SerializableString esc = customEscapes.getEscapeSequence(ch);
if (esc != null) {
outputPtr = _writeCustomEscape(outputBuffer, outputPtr, esc, end-offset);
continue;
}
if (ch <= 0x7FF) { // fine, just needs 2 byte output
outputBuffer[outputPtr++] = (byte) (0xc0 | (ch >> 6));
outputBuffer[outputPtr++] = (byte) (0x80 | (ch & 0x3f));
} else {
outputPtr = _outputMultiByteChar(ch, outputPtr);
}
}
_outputTail = outputPtr;
}

I believe the fix is easy, we can just port the changes we made in #1335 and #1360 to the two _writeCustomStringSegment2() methods. I am working on a pull request.

@cowtowncoder cowtowncoder added this to the 2.18.3 milestone Feb 5, 2025
@cowtowncoder cowtowncoder added the 2.18 Issues planned at earliest for 2.18 label Feb 5, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
2.18 Issues planned at earliest for 2.18
Projects
None yet
Development

No branches or pull requests

2 participants