Shared keys can cause unescaped write of BYTE_MARKER_END_OF_CONTENT #18

bleskes · 2014-06-17T07:37:37Z

If you write >256 shared keys, the output stream contains -1 byte even though it's not the end of the document. This is a problem for us as we rely on the -1 marker to separate docs from a binary stream without the need to parse them.

Here is a failing junit reproduction:

    @Test
    public void testJacksonEndOfDocByte() throws Exception {
        final String[] currentField = new String[1];
        OutputStream outputStream = new OutputStream() {
            @Override
            public void write(int b) throws IOException {
                if ((byte) b == -1) {
                    fail("got end of doc marker while writing " + currentField[0]);
                }
            }
        };

        SmileGenerator generator =  new SmileFactory().createGenerator(outputStream);
        generator.writeStartObject();
        generator.writeFieldName("a");
        generator.writeStartObject();
        for (int i=0; i < 300; i++) {
            currentField[0] = "f_"+i;
            generator.writeNumberField(currentField[0], i);
            generator.flush();
        }
        currentField[0] = "";
        generator.writeEndObject();
        generator.writeFieldName("b");
        generator.writeStartObject();
        for (int i=0; i < 300; i++) {
            currentField[0] = "f_"+i;
            generator.writeNumberField(currentField[0], i);
            generator.flush();
        }
        currentField[0] = "";
        generator.writeEndObject();
        generator.writeEndObject();
    }

I believe the problem lies in this line, which writes an un escaped byte as the second parameter:

jackson-dataformat-smile/src/main/java/com/fasterxml/jackson/dataformat/smile/SmileGenerator.java

Line 820 in 284cca9

_writeBytes(((byte) (TOKEN_PREFIX_KEY_SHARED_LONG + (ix >> 8))), (byte) ix);

The text was updated successfully, but these errors were encountered:

cowtowncoder · 2014-06-17T15:30:03Z

Thank you for reporting this; -1 (i.e. 0xFF) definitely should not get written as it is reserved.

cowtowncoder · 2014-06-17T16:06:13Z

Also: I am guessing that since that byte acts as an end-of-input, it may also avoid getting caught by unit tests... nasty. I will need to see if this also occurs with 2.3; if so, obviously need to backport fix for 2.3.4 as well.

bleskes · 2014-06-17T17:47:40Z

Thx for looking into it. I actually run into it on the 2.3 branch

On Tue, Jun 17, 2014 at 6:06 PM, Tatu Saloranta [email protected]
wrote:

Also: I am guessing that since that byte acts as an end-of-input, it may also avoid getting caught by unit tests... nasty. I will need to see if this also occurs with 2.3; if so, obviously need to backport fix for 2.3.4 as well.

Reply to this email directly or view it on GitHub:
#18 (comment)

cowtowncoder · 2014-06-17T18:26:06Z

@bleskes Ok. I am guessing it probably exists in 2.4, but we'll see.

cowtowncoder · 2014-06-17T18:45:39Z

Hmmh. I could not reproduce the issue immediately. But I have one thought on problem -- since -1 is indeed the end-marker, is it possible that JsonGenerator.WRITE_END_MARKER might be enabled (or mis-interpreted)? That would write end-marker, either accidentally or incidentally; and parser would not notice.

Test I tried actually uses SmileParser and not just checking bytes, so I will continue trying to reproduce the issue.

cowtowncoder · 2014-06-17T18:48:02Z

.... ok. Interesting; yes, I can actually see 0xFF there in the middle. So I can reproduce it, although parser had no trouble with it. Fascinating...

cowtowncoder · 2014-06-17T18:55:11Z

Ouch. This just came from unfortunate to really, really bad. I need to triple-check, but it is possible this would be a flaw in specification itself, if it does not reserve value. While it will be possible for parser to be forgiving, it will be tricky to figure out a way to have seamless support between old and new implementations.

cowtowncoder · 2014-06-17T20:32:39Z

So: basically format would need to reserve byte range also for long back-references. Problem here is that while new parser can accept old values (for backwards-compatibility), and we also have version number of format to use (which we probably have to now start use), it will require careful coreography to make things work in least intrusive manner.

It is possible that there are some combinations that won't work -- new encoder, old decoder, most likely -- but what I do not want are mysterious failures half-way through processing.
If, using version boost, we can produce solid up-front exception, that may be the best thing to do. But to connect things, it should be possible to have transitional codec version (2.4.x) that can bridge the gap; accept new and old format versions, but produce one of them (first default to old, allow use of Feature for new), and then as later step, migrate to new version.
That may sound complicated (and it is), but it is necessary to be very careful regarding changes to actual encoding, and backwards compatibiliyt.

cowtowncoder · 2014-07-14T18:28:39Z

@bleskes: Actually I think I figured out a simple(r) short-term solution for encoding problem. It will not resolve the problem of existing data that may have unexpected markers, but will stop generation of such data, without requiring any changes to decoders.

So: the change needed is to add special handling for cases where invalid byte markers would be produced as back-reference. We can not omit production of back-referenced values (both because they are actually needed as first-time values; and to be able to skip "invalid" index itself); but we can omit adding value into back-referenceable lookup map. Given this, such back-reference will never be used, and no invalid byte value will be used.

Downside of such a change is that values added into these slots will end up being duplicated, even where theoretically we could use a back-reference. But this is a minor sub-optimality; and in fact encoders are not required to have 100% reliable detection of back-referenceable keys or values.

Whether there should be a change to specification itself is an open question; the first step could be to document the need for this work-around for generator. It should then be possible to define "alternate" codes to use to refer to blocked values (ones that can not be used because they would result in use of invalid byte), as sort of alias. But doing that will require update of version value, and bit of future-proofing with codec.

bleskes · 2014-07-16T09:15:28Z

Nice! I think it's a very good pragmatic solution. I agree it's not "clean" from a spec perspective but it's effective and simple

cowtowncoder · 2014-07-16T17:47:35Z

Yes. I need to make sure update specification to mention this issue, and this work-around, so that other codecs can handle it too.

This fixes issue 18 in Smile (FasterXML/jackson-dataformat-smile#18) closes elastic#7327

This fixes issue 18 in Smile (FasterXML/jackson-dataformat-smile#18) closes #7327

cowtowncoder added a commit that referenced this issue Jun 17, 2014

Add a unit test for #18

60cb668

cowtowncoder added a commit that referenced this issue Jun 17, 2014

Add a test for #18, but under 'failing' as it's not yet fixed

85dc23f

cowtowncoder added a commit that referenced this issue Jul 16, 2014

Fix for #18

ba360bd

cowtowncoder added a commit that referenced this issue Jul 16, 2014

Backport #18 fix to 2.3 branch

a849322

cowtowncoder added a commit that referenced this issue Jul 16, 2014

Backport #18 fix in 2.2 as well, just in case

add035f

cowtowncoder added this to the 2.4.1.1 milestone Jul 16, 2014

cowtowncoder closed this as completed Jul 18, 2014

kimchy added a commit to kimchy/elasticsearch that referenced this issue Aug 19, 2014

Upgrade to Jackson Smile 2.4.1.1

45c10fb

This fixes issue 18 in Smile (FasterXML/jackson-dataformat-smile#18) closes elastic#7327

kimchy mentioned this issue Aug 19, 2014

Upgrade to Jackson Smile 2.4.1.1 elastic/elasticsearch#7327

Closed

kimchy added a commit to elastic/elasticsearch that referenced this issue Aug 19, 2014

Upgrade to Jackson Smile 2.4.1.1

2ec68ed

This fixes issue 18 in Smile (FasterXML/jackson-dataformat-smile#18) closes #7327

kimchy added a commit to elastic/elasticsearch that referenced this issue Aug 19, 2014

Upgrade to Jackson Smile 2.4.1.1

3da5773

This fixes issue 18 in Smile (FasterXML/jackson-dataformat-smile#18) closes #7327

kimchy added a commit to elastic/elasticsearch that referenced this issue Sep 8, 2014

Upgrade to Jackson Smile 2.4.1.1

cce619a

This fixes issue 18 in Smile (FasterXML/jackson-dataformat-smile#18) closes #7327

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Shared keys can cause unescaped write of BYTE_MARKER_END_OF_CONTENT #18

Shared keys can cause unescaped write of BYTE_MARKER_END_OF_CONTENT #18

bleskes commented Jun 17, 2014

cowtowncoder commented Jun 17, 2014

Uh oh!

cowtowncoder commented Jun 17, 2014

Uh oh!

bleskes commented Jun 17, 2014

Also: I am guessing that since that byte acts as an end-of-input, it may also avoid getting caught by unit tests... nasty. I will need to see if this also occurs with 2.3; if so, obviously need to backport fix for 2.3.4 as well.

Uh oh!

cowtowncoder commented Jun 17, 2014

Uh oh!

cowtowncoder commented Jun 17, 2014

Uh oh!

cowtowncoder commented Jun 17, 2014

Uh oh!

cowtowncoder commented Jun 17, 2014

Uh oh!

cowtowncoder commented Jun 17, 2014

Uh oh!

cowtowncoder commented Jul 14, 2014

Uh oh!

bleskes commented Jul 16, 2014

Uh oh!

cowtowncoder commented Jul 16, 2014

Uh oh!

Shared keys can cause unescaped write of BYTE_MARKER_END_OF_CONTENT #18

Shared keys can cause unescaped write of BYTE_MARKER_END_OF_CONTENT #18

Comments

bleskes commented Jun 17, 2014

cowtowncoder commented Jun 17, 2014

Uh oh!

cowtowncoder commented Jun 17, 2014

Uh oh!

bleskes commented Jun 17, 2014

Also: I am guessing that since that byte acts as an end-of-input, it may also avoid getting caught by unit tests... nasty. I will need to see if this also occurs with 2.3; if so, obviously need to backport fix for 2.3.4 as well.

Uh oh!

cowtowncoder commented Jun 17, 2014

Uh oh!

cowtowncoder commented Jun 17, 2014

Uh oh!

cowtowncoder commented Jun 17, 2014

Uh oh!

cowtowncoder commented Jun 17, 2014

Uh oh!

cowtowncoder commented Jun 17, 2014

Uh oh!

cowtowncoder commented Jul 14, 2014

Uh oh!

bleskes commented Jul 16, 2014

Uh oh!

cowtowncoder commented Jul 16, 2014

Uh oh!