-
Notifications
You must be signed in to change notification settings - Fork 75
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
clp-s: Improve error reporting and error handling for JSON parsing during compression #310
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work! Most of the code looks good. I do have one concern, though, regarding how we handle UTF8_ERROR
@@ -3,6 +3,9 @@ | |||
#include <iostream> | |||
#include <stack> | |||
|
|||
#include <simdjson.h> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to include this file?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Technically no, but I figure we use so many simdjson methods and classes in this file that we really should.
} | ||
|
||
if (false == read_new_json()) { | ||
return false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if we get UTF8_ERROR
, set m_truncated_bytes
and call read_new_json()
, which returns true, but since it encounters UTF8_ERROR
again, so that m_doc_it != m_stream.end()
and get_json()
returns false, as a result, we just exit the loop in JsonParser.cpp without parsing the new log. But if we keep doubling the buffer size, we can actually get a complete JSON log.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch. This is somewhat annoying to handle correctly, but my instinct is that we can turn the core of the get_json
function into another while loop and get rid of the behaviour where we can exit succesfully at the start of the function or after the read_new_json
call at the end of the function.
++m_doc_it; | ||
// UTF8 error wasn't at the end of the stream -- terminate | ||
if (m_doc_it != m_stream.end()) { | ||
m_error_code = simdjson::error_code::UTF8_ERROR; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just curious how can the code enter this branch
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The intention was for this code to execute once we've hit a UTF8 error but we don't know yet whether it is in the middle of the stream or at the end of the stream (and possibly just a truncated document). I added the second way of setting truncated bytes after I wrote this code block to fix a bug, and am realizing that this flow is pretty confusing now.
I think I'll rewrite this to be clearer/more explicit about what its trying to do.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh yeah, that's right.
Co-authored-by: wraymo <[email protected]>
if (m_doc_it != m_stream.end()) { | ||
// Regular UTF-8 error in the middle of the document stream. | ||
m_error_code = simdjson::error_code::UTF8_ERROR; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if it's a case of UTF-8 error in the middle of the document stream
"{\"a\"\xff: 1, \"b\": 2}\n{\"a\": 1, \"b\": 2}\n"
. simdjson reports UTF8_ERROR and on error cases, ++
just sets finished
to true which results in m_doc_it == m_stream.end()
. Can you come up with a case of such regular UTF-8 error in the middle of the document stream?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I'm worried about the case where a developer creates a string field with invalid UTF-8 character (e.g. they copy the hex for some special character off the web incorrectly). If incrementing the iterator always leads to the end iterator on error we'll have to approach this case differently. Maybe we could add a "maximum document size" configuration option with a default of e.g. 200MB then report UTF8 error if either eof is hit or the buffer size is greater than the maximum document size? (And also introduce another new error for document size being too large?)
I'm open to other ideas too though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I'm a little bit worried about the memory usage and efficiency. From what I noticed, this if branch cannot catch such errors, and it will keep doubling the buffer size until reaching the end of the file. If we can not distinguish different UTF-8 errors from simdjson, the "maximum document size" configuration option could work.
Co-authored-by: wraymo <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems that when encountering parsing errors, clp-s reports something like 2024-03-15T13:42:50.213-04:00 [error] FileWriter not closed before being destroyed - may cause data loss
because we don't close ArchiveWriter.
Yeah, I can have it close the ArchiveWriter for now, but we may want to throw out data for this archive, and may want to avoid recording this archive in the metadata db. |
That's right |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems that the error message is still printed for TimestampDictionaryWriter because m_archive_writer->close()
only closes local timestamp dictionary. Do you want to fix this problem now or merge no-subarchive PR first (which needs your approval)?
Sure let's merge no sub-archive. I don't mind handling the merge conflicts. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The commit message looks good to me.
References
This PR may directly solve #289, and if not it should make the issue more obvious due to improved error reporting.
Description
The changes in this PR aim to
The first and second point are mostly addressed by being more careful about checking for simdjson errors and passing them along.
For the third point each time we start parsing a valid json document we get simdjson to tell us the position of the start of the following json document. With simdjson's parse_many feature simdjson will provide an iterator over all valid json documents it finds in a provided buffer and report any leftover unparsed data with the
truncated_bytes()
function. This is to handle cases where a buffer is split half-way through a JSON document. Generally this works, however if the buffer terminates halfway through a UTF-8 sequence simdjson will provide an iterator to the incomplete document at the end of the stream which will have a UTF8-Error set, andtruncated_bytes()
will be 0. In this case we calculate the number of truncated bytes by using our running notion of the start of the next JSON document.Previously we handled this case by seeking backwards from the end of the buffer until we found a newline character which is more error-prone (e.g. only works for ndjson).
Validation performed