Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

clp-s: Improve error reporting and error handling for JSON parsing during compression #310

Merged
merged 10 commits into from
Mar 16, 2024

Conversation

gibber9809
Copy link
Contributor

@gibber9809 gibber9809 commented Mar 2, 2024

References

This PR may directly solve #289, and if not it should make the issue more obvious due to improved error reporting.

Description

The changes in this PR aim to

  1. Improve error messages when compression fails on a parsing error
  2. Always be explicit about parsing failures and have clp-s always exit with an error code when parsing fails
  3. Improve the handling of an edge case involving the working buffer ending halfway through a UTF-8 character

The first and second point are mostly addressed by being more careful about checking for simdjson errors and passing them along.

For the third point each time we start parsing a valid json document we get simdjson to tell us the position of the start of the following json document. With simdjson's parse_many feature simdjson will provide an iterator over all valid json documents it finds in a provided buffer and report any leftover unparsed data with the truncated_bytes() function. This is to handle cases where a buffer is split half-way through a JSON document. Generally this works, however if the buffer terminates halfway through a UTF-8 sequence simdjson will provide an iterator to the incomplete document at the end of the stream which will have a UTF8-Error set, and truncated_bytes() will be 0. In this case we calculate the number of truncated bytes by using our running notion of the start of the next JSON document.

Previously we handled this case by seeking backwards from the end of the buffer until we found a newline character which is more error-prone (e.g. only works for ndjson).

Validation performed

  • Validated that compression works correctly for cockroachdb dataset (contains a lot of unicode)
  • Validated that compression works correctly for several other key datasets
  • Validated that errors are reported and compression fails for some toy invalid examples (log file that starts with non-json data and log file that has non-json data partway through)
  • Validated that there is no noticeable impact on compression speed

@gibber9809 gibber9809 marked this pull request as draft March 2, 2024 23:43
@gibber9809 gibber9809 marked this pull request as ready for review March 4, 2024 19:32
@gibber9809 gibber9809 requested a review from wraymo March 4, 2024 19:32
Copy link
Contributor

@wraymo wraymo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work! Most of the code looks good. I do have one concern, though, regarding how we handle UTF8_ERROR

@@ -3,6 +3,9 @@
#include <iostream>
#include <stack>

#include <simdjson.h>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to include this file?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Technically no, but I figure we use so many simdjson methods and classes in this file that we really should.

}

if (false == read_new_json()) {
return false;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if we get UTF8_ERROR, set m_truncated_bytes and call read_new_json(), which returns true, but since it encounters UTF8_ERROR again, so that m_doc_it != m_stream.end() and get_json() returns false, as a result, we just exit the loop in JsonParser.cpp without parsing the new log. But if we keep doubling the buffer size, we can actually get a complete JSON log.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. This is somewhat annoying to handle correctly, but my instinct is that we can turn the core of the get_json function into another while loop and get rid of the behaviour where we can exit succesfully at the start of the function or after the read_new_json call at the end of the function.

++m_doc_it;
// UTF8 error wasn't at the end of the stream -- terminate
if (m_doc_it != m_stream.end()) {
m_error_code = simdjson::error_code::UTF8_ERROR;
Copy link
Contributor

@wraymo wraymo Mar 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just curious how can the code enter this branch

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The intention was for this code to execute once we've hit a UTF8 error but we don't know yet whether it is in the middle of the stream or at the end of the stream (and possibly just a truncated document). I added the second way of setting truncated bytes after I wrote this code block to fix a bug, and am realizing that this flow is pretty confusing now.

I think I'll rewrite this to be clearer/more explicit about what its trying to do.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh yeah, that's right.

@gibber9809 gibber9809 requested a review from wraymo March 11, 2024 22:11
Comment on lines 133 to 135
if (m_doc_it != m_stream.end()) {
// Regular UTF-8 error in the middle of the document stream.
m_error_code = simdjson::error_code::UTF8_ERROR;
Copy link
Contributor

@wraymo wraymo Mar 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if it's a case of UTF-8 error in the middle of the document stream
"{\"a\"\xff: 1, \"b\": 2}\n{\"a\": 1, \"b\": 2}\n". simdjson reports UTF8_ERROR and on error cases, ++ just sets finished to true which results in m_doc_it == m_stream.end(). Can you come up with a case of such regular UTF-8 error in the middle of the document stream?

Copy link
Contributor Author

@gibber9809 gibber9809 Mar 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I'm worried about the case where a developer creates a string field with invalid UTF-8 character (e.g. they copy the hex for some special character off the web incorrectly). If incrementing the iterator always leads to the end iterator on error we'll have to approach this case differently. Maybe we could add a "maximum document size" configuration option with a default of e.g. 200MB then report UTF8 error if either eof is hit or the buffer size is greater than the maximum document size? (And also introduce another new error for document size being too large?)

I'm open to other ideas too though.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I'm a little bit worried about the memory usage and efficiency. From what I noticed, this if branch cannot catch such errors, and it will keep doubling the buffer size until reaching the end of the file. If we can not distinguish different UTF-8 errors from simdjson, the "maximum document size" configuration option could work.

@gibber9809 gibber9809 requested a review from wraymo March 15, 2024 15:41
Copy link
Contributor

@wraymo wraymo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that when encountering parsing errors, clp-s reports something like 2024-03-15T13:42:50.213-04:00 [error] FileWriter not closed before being destroyed - may cause data loss because we don't close ArchiveWriter.

@gibber9809
Copy link
Contributor Author

It seems that when encountering parsing errors, clp-s reports something like 2024-03-15T13:42:50.213-04:00 [error] FileWriter not closed before being destroyed - may cause data loss because we don't close ArchiveWriter.

Yeah, I can have it close the ArchiveWriter for now, but we may want to throw out data for this archive, and may want to avoid recording this archive in the metadata db.

@wraymo
Copy link
Contributor

wraymo commented Mar 15, 2024

It seems that when encountering parsing errors, clp-s reports something like 2024-03-15T13:42:50.213-04:00 [error] FileWriter not closed before being destroyed - may cause data loss because we don't close ArchiveWriter.

Yeah, I can have it close the ArchiveWriter for now, but we may want to throw out data for this archive, and may want to avoid recording this archive in the metadata db.

That's right

@gibber9809 gibber9809 requested a review from wraymo March 15, 2024 18:59
Copy link
Contributor

@wraymo wraymo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems that the error message is still printed for TimestampDictionaryWriter because m_archive_writer->close() only closes local timestamp dictionary. Do you want to fix this problem now or merge no-subarchive PR first (which needs your approval)?

@gibber9809
Copy link
Contributor Author

Seems that the error message is still printed for TimestampDictionaryWriter because m_archive_writer->close() only closes local timestamp dictionary. Do you want to fix this problem now or merge no-subarchive PR first (which needs your approval)?

Sure let's merge no sub-archive. I don't mind handling the merge conflicts.

Copy link
Contributor

@wraymo wraymo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The commit message looks good to me.

@gibber9809 gibber9809 merged commit dcdf03b into y-scope:main Mar 16, 2024
10 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants