Skip to content

Shared keys can cause unescaped write of BYTE_MARKER_END_OF_CONTENT #18

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
bleskes opened this issue Jun 17, 2014 · 11 comments
Closed

Shared keys can cause unescaped write of BYTE_MARKER_END_OF_CONTENT #18

bleskes opened this issue Jun 17, 2014 · 11 comments
Milestone

Comments

@bleskes
Copy link

bleskes commented Jun 17, 2014

If you write >256 shared keys, the output stream contains -1 byte even though it's not the end of the document. This is a problem for us as we rely on the -1 marker to separate docs from a binary stream without the need to parse them.

Here is a failing junit reproduction:

    @Test
    public void testJacksonEndOfDocByte() throws Exception {
        final String[] currentField = new String[1];
        OutputStream outputStream = new OutputStream() {
            @Override
            public void write(int b) throws IOException {
                if ((byte) b == -1) {
                    fail("got end of doc marker while writing " + currentField[0]);
                }
            }
        };

        SmileGenerator generator =  new SmileFactory().createGenerator(outputStream);
        generator.writeStartObject();
        generator.writeFieldName("a");
        generator.writeStartObject();
        for (int i=0; i < 300; i++) {
            currentField[0] = "f_"+i;
            generator.writeNumberField(currentField[0], i);
            generator.flush();
        }
        currentField[0] = "";
        generator.writeEndObject();
        generator.writeFieldName("b");
        generator.writeStartObject();
        for (int i=0; i < 300; i++) {
            currentField[0] = "f_"+i;
            generator.writeNumberField(currentField[0], i);
            generator.flush();
        }
        currentField[0] = "";
        generator.writeEndObject();
        generator.writeEndObject();
    }

I believe the problem lies in this line, which writes an un escaped byte as the second parameter:

_writeBytes(((byte) (TOKEN_PREFIX_KEY_SHARED_LONG + (ix >> 8))), (byte) ix);

@cowtowncoder
Copy link
Member

Thank you for reporting this; -1 (i.e. 0xFF) definitely should not get written as it is reserved.

@cowtowncoder
Copy link
Member

Also: I am guessing that since that byte acts as an end-of-input, it may also avoid getting caught by unit tests... nasty. I will need to see if this also occurs with 2.3; if so, obviously need to backport fix for 2.3.4 as well.

@bleskes
Copy link
Author

bleskes commented Jun 17, 2014

Thx for looking into it. I actually run into it on the 2.3 branch

On Tue, Jun 17, 2014 at 6:06 PM, Tatu Saloranta [email protected]
wrote:

Also: I am guessing that since that byte acts as an end-of-input, it may also avoid getting caught by unit tests... nasty. I will need to see if this also occurs with 2.3; if so, obviously need to backport fix for 2.3.4 as well.

Reply to this email directly or view it on GitHub:
#18 (comment)

@cowtowncoder
Copy link
Member

@bleskes Ok. I am guessing it probably exists in 2.4, but we'll see.

@cowtowncoder
Copy link
Member

Hmmh. I could not reproduce the issue immediately. But I have one thought on problem -- since -1 is indeed the end-marker, is it possible that JsonGenerator.WRITE_END_MARKER might be enabled (or mis-interpreted)? That would write end-marker, either accidentally or incidentally; and parser would not notice.

Test I tried actually uses SmileParser and not just checking bytes, so I will continue trying to reproduce the issue.

@cowtowncoder
Copy link
Member

.... ok. Interesting; yes, I can actually see 0xFF there in the middle. So I can reproduce it, although parser had no trouble with it. Fascinating...

cowtowncoder added a commit that referenced this issue Jun 17, 2014
@cowtowncoder
Copy link
Member

Ouch. This just came from unfortunate to really, really bad. I need to triple-check, but it is possible this would be a flaw in specification itself, if it does not reserve value. While it will be possible for parser to be forgiving, it will be tricky to figure out a way to have seamless support between old and new implementations.

@cowtowncoder
Copy link
Member

So: basically format would need to reserve byte range also for long back-references. Problem here is that while new parser can accept old values (for backwards-compatibility), and we also have version number of format to use (which we probably have to now start use), it will require careful coreography to make things work in least intrusive manner.

It is possible that there are some combinations that won't work -- new encoder, old decoder, most likely -- but what I do not want are mysterious failures half-way through processing.
If, using version boost, we can produce solid up-front exception, that may be the best thing to do. But to connect things, it should be possible to have transitional codec version (2.4.x) that can bridge the gap; accept new and old format versions, but produce one of them (first default to old, allow use of Feature for new), and then as later step, migrate to new version.
That may sound complicated (and it is), but it is necessary to be very careful regarding changes to actual encoding, and backwards compatibiliyt.

@cowtowncoder
Copy link
Member

@bleskes: Actually I think I figured out a simple(r) short-term solution for encoding problem. It will not resolve the problem of existing data that may have unexpected markers, but will stop generation of such data, without requiring any changes to decoders.

So: the change needed is to add special handling for cases where invalid byte markers would be produced as back-reference. We can not omit production of back-referenced values (both because they are actually needed as first-time values; and to be able to skip "invalid" index itself); but we can omit adding value into back-referenceable lookup map. Given this, such back-reference will never be used, and no invalid byte value will be used.

Downside of such a change is that values added into these slots will end up being duplicated, even where theoretically we could use a back-reference. But this is a minor sub-optimality; and in fact encoders are not required to have 100% reliable detection of back-referenceable keys or values.

Whether there should be a change to specification itself is an open question; the first step could be to document the need for this work-around for generator. It should then be possible to define "alternate" codes to use to refer to blocked values (ones that can not be used because they would result in use of invalid byte), as sort of alias. But doing that will require update of version value, and bit of future-proofing with codec.

cowtowncoder added a commit that referenced this issue Jul 16, 2014
cowtowncoder added a commit that referenced this issue Jul 16, 2014
@cowtowncoder cowtowncoder added this to the 2.4.1.1 milestone Jul 16, 2014
@bleskes
Copy link
Author

bleskes commented Jul 16, 2014

Nice! I think it's a very good pragmatic solution. I agree it's not "clean" from a spec perspective but it's effective and simple

@cowtowncoder
Copy link
Member

Yes. I need to make sure update specification to mention this issue, and this work-around, so that other codecs can handle it too.

kimchy added a commit to kimchy/elasticsearch that referenced this issue Aug 19, 2014
kimchy added a commit to elastic/elasticsearch that referenced this issue Aug 19, 2014
kimchy added a commit to elastic/elasticsearch that referenced this issue Aug 19, 2014
kimchy added a commit to elastic/elasticsearch that referenced this issue Sep 8, 2014
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants