clp-s: Improve error reporting and error handling for JSON parsing during compression #310

gibber9809 · 2024-03-02T23:11:28Z

References

This PR may directly solve #289, and if not it should make the issue more obvious due to improved error reporting.

Description

The changes in this PR aim to

Improve error messages when compression fails on a parsing error
Always be explicit about parsing failures and have clp-s always exit with an error code when parsing fails
Improve the handling of an edge case involving the working buffer ending halfway through a UTF-8 character

The first and second point are mostly addressed by being more careful about checking for simdjson errors and passing them along.

For the third point each time we start parsing a valid json document we get simdjson to tell us the position of the start of the following json document. With simdjson's parse_many feature simdjson will provide an iterator over all valid json documents it finds in a provided buffer and report any leftover unparsed data with the truncated_bytes() function. This is to handle cases where a buffer is split half-way through a JSON document. Generally this works, however if the buffer terminates halfway through a UTF-8 sequence simdjson will provide an iterator to the incomplete document at the end of the stream which will have a UTF8-Error set, and truncated_bytes() will be 0. In this case we calculate the number of truncated bytes by using our running notion of the start of the next JSON document.

Previously we handled this case by seeking backwards from the end of the buffer until we found a newline character which is more error-prone (e.g. only works for ndjson).

Validation performed

Validated that compression works correctly for cockroachdb dataset (contains a lot of unicode)
Validated that compression works correctly for several other key datasets
Validated that errors are reported and compression fails for some toy invalid examples (log file that starts with non-json data and log file that has non-json data partway through)
Validated that there is no noticeable impact on compression speed

…rting during parsing

wraymo

Nice work! Most of the code looks good. I do have one concern, though, regarding how we handle UTF8_ERROR

wraymo · 2024-03-07T20:13:26Z

components/core/src/clp_s/JsonParser.cpp

@@ -3,6 +3,9 @@
 #include <iostream>
 #include <stack>

+#include <simdjson.h>


Do we need to include this file?

Technically no, but I figure we use so many simdjson methods and classes in this file that we really should.

components/core/src/clp_s/JsonFileIterator.hpp

wraymo · 2024-03-08T17:45:20Z

components/core/src/clp_s/JsonFileIterator.cpp

+    }
+
+    if (false == read_new_json()) {
+        return false;


What if we get UTF8_ERROR, set m_truncated_bytes and call read_new_json(), which returns true, but since it encounters UTF8_ERROR again, so that m_doc_it != m_stream.end() and get_json() returns false, as a result, we just exit the loop in JsonParser.cpp without parsing the new log. But if we keep doubling the buffer size, we can actually get a complete JSON log.

Good catch. This is somewhat annoying to handle correctly, but my instinct is that we can turn the core of the get_json function into another while loop and get rid of the behaviour where we can exit succesfully at the start of the function or after the read_new_json call at the end of the function.

wraymo · 2024-03-08T18:05:02Z

components/core/src/clp_s/JsonFileIterator.cpp

+        ++m_doc_it;
+        // UTF8 error wasn't at the end of the stream -- terminate
+        if (m_doc_it != m_stream.end()) {
+            m_error_code = simdjson::error_code::UTF8_ERROR;


Just curious how can the code enter this branch

The intention was for this code to execute once we've hit a UTF8 error but we don't know yet whether it is in the middle of the stream or at the end of the stream (and possibly just a truncated document). I added the second way of setting truncated bytes after I wrote this code block to fix a bug, and am realizing that this flow is pretty confusing now.

I think I'll rewrite this to be clearer/more explicit about what its trying to do.

Oh yeah, that's right.

Co-authored-by: wraymo <[email protected]>

wraymo · 2024-03-13T17:26:55Z

components/core/src/clp_s/JsonFileIterator.cpp

+            if (m_doc_it != m_stream.end()) {
+                // Regular UTF-8 error in the middle of the document stream.
+                m_error_code = simdjson::error_code::UTF8_ERROR;


Not sure if it's a case of UTF-8 error in the middle of the document stream
"{\"a\"\xff: 1, \"b\": 2}\n{\"a\": 1, \"b\": 2}\n". simdjson reports UTF8_ERROR and on error cases, ++ just sets finished to true which results in m_doc_it == m_stream.end(). Can you come up with a case of such regular UTF-8 error in the middle of the document stream?

Yeah, I'm worried about the case where a developer creates a string field with invalid UTF-8 character (e.g. they copy the hex for some special character off the web incorrectly). If incrementing the iterator always leads to the end iterator on error we'll have to approach this case differently. Maybe we could add a "maximum document size" configuration option with a default of e.g. 200MB then report UTF8 error if either eof is hit or the buffer size is greater than the maximum document size? (And also introduce another new error for document size being too large?)

I'm open to other ideas too though.

Yeah, I'm a little bit worried about the memory usage and efficiency. From what I noticed, this if branch cannot catch such errors, and it will keep doubling the buffer size until reaching the end of the file. If we can not distinguish different UTF-8 errors from simdjson, the "maximum document size" configuration option could work.

components/core/src/clp_s/JsonFileIterator.cpp

Co-authored-by: wraymo <[email protected]>

…dling

wraymo

It seems that when encountering parsing errors, clp-s reports something like 2024-03-15T13:42:50.213-04:00 [error] FileWriter not closed before being destroyed - may cause data loss because we don't close ArchiveWriter.

gibber9809 · 2024-03-15T18:23:47Z

It seems that when encountering parsing errors, clp-s reports something like 2024-03-15T13:42:50.213-04:00 [error] FileWriter not closed before being destroyed - may cause data loss because we don't close ArchiveWriter.

Yeah, I can have it close the ArchiveWriter for now, but we may want to throw out data for this archive, and may want to avoid recording this archive in the metadata db.

wraymo · 2024-03-15T18:46:58Z

It seems that when encountering parsing errors, clp-s reports something like 2024-03-15T13:42:50.213-04:00 [error] FileWriter not closed before being destroyed - may cause data loss because we don't close ArchiveWriter.

Yeah, I can have it close the ArchiveWriter for now, but we may want to throw out data for this archive, and may want to avoid recording this archive in the metadata db.

That's right

…rse method

wraymo

Seems that the error message is still printed for TimestampDictionaryWriter because m_archive_writer->close() only closes local timestamp dictionary. Do you want to fix this problem now or merge no-subarchive PR first (which needs your approval)?

gibber9809 · 2024-03-15T22:57:48Z

Seems that the error message is still printed for TimestampDictionaryWriter because m_archive_writer->close() only closes local timestamp dictionary. Do you want to fix this problem now or merge no-subarchive PR first (which needs your approval)?

Sure let's merge no sub-archive. I don't mind handling the merge conflicts.

wraymo

The commit message looks good to me.

gibber9809 added 3 commits March 2, 2024 21:51

clp-s: Handle UTF-8 edge cases more carefully, and improve error repo…

5373087

…rting during parsing

Fix minor bug

5915d0b

clp-s: Add more error handling to detect invalid JSON data at parse time

7aab165

gibber9809 marked this pull request as draft March 2, 2024 23:43

Fix bug where truncated bytes at end of file were sometimes not reported

b900e8a

gibber9809 marked this pull request as ready for review March 4, 2024 19:32

gibber9809 requested a review from wraymo March 4, 2024 19:32

wraymo reviewed Mar 8, 2024

View reviewed changes

gibber9809 and others added 2 commits March 10, 2024 21:53

Apply code review suggestion

de81eea

Co-authored-by: wraymo <[email protected]>

Fix some edge cases and clean up code

0557ecc

gibber9809 requested a review from wraymo March 11, 2024 22:11

wraymo reviewed Mar 13, 2024

View reviewed changes

gibber9809 mentioned this pull request Mar 13, 2024

CLP-S: Undefined behavior on file with no read permission #325

Closed

gibber9809 and others added 2 commits March 13, 2024 17:32

Apply code review suggestion

4d3fb6b

Co-authored-by: wraymo <[email protected]>

clp-s: Add maximum document size config option and clean up error han…

0c30606

…dling

gibber9809 requested a review from wraymo March 15, 2024 15:41

wraymo reviewed Mar 15, 2024

View reviewed changes

clp-s: Close archive writer before returning failure in JsonParser pa…

db91b93

…rse method

gibber9809 requested a review from wraymo March 15, 2024 18:59

wraymo reviewed Mar 15, 2024

View reviewed changes

Merge remote-tracking branch 'upstream' into json-file-iterator-fix

7fdbe97

wraymo approved these changes Mar 16, 2024

View reviewed changes

gibber9809 merged commit dcdf03b into y-scope:main Mar 16, 2024
10 checks passed

gibber9809 mentioned this pull request Mar 18, 2024

clp-s is truncating the json bytes while compression #289

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

clp-s: Improve error reporting and error handling for JSON parsing during compression #310

clp-s: Improve error reporting and error handling for JSON parsing during compression #310

gibber9809 commented Mar 2, 2024 •

edited

Loading

wraymo left a comment

wraymo Mar 7, 2024

gibber9809 Mar 11, 2024

wraymo Mar 8, 2024

gibber9809 Mar 11, 2024

wraymo Mar 8, 2024 •

edited

Loading

gibber9809 Mar 11, 2024

wraymo Mar 11, 2024

wraymo Mar 13, 2024 •

edited

Loading

gibber9809 Mar 13, 2024 •

edited

Loading

wraymo Mar 14, 2024

wraymo left a comment

gibber9809 commented Mar 15, 2024

wraymo commented Mar 15, 2024

wraymo left a comment

gibber9809 commented Mar 15, 2024

wraymo left a comment

clp-s: Improve error reporting and error handling for JSON parsing during compression #310

clp-s: Improve error reporting and error handling for JSON parsing during compression #310

Conversation

gibber9809 commented Mar 2, 2024 • edited Loading

References

Description

Validation performed

wraymo left a comment

Choose a reason for hiding this comment

wraymo Mar 7, 2024

Choose a reason for hiding this comment

gibber9809 Mar 11, 2024

Choose a reason for hiding this comment

wraymo Mar 8, 2024

Choose a reason for hiding this comment

gibber9809 Mar 11, 2024

Choose a reason for hiding this comment

wraymo Mar 8, 2024 • edited Loading

Choose a reason for hiding this comment

gibber9809 Mar 11, 2024

Choose a reason for hiding this comment

wraymo Mar 11, 2024

Choose a reason for hiding this comment

wraymo Mar 13, 2024 • edited Loading

Choose a reason for hiding this comment

gibber9809 Mar 13, 2024 • edited Loading

Choose a reason for hiding this comment

wraymo Mar 14, 2024

Choose a reason for hiding this comment

wraymo left a comment

Choose a reason for hiding this comment

gibber9809 commented Mar 15, 2024

wraymo commented Mar 15, 2024

wraymo left a comment

Choose a reason for hiding this comment

gibber9809 commented Mar 15, 2024

wraymo left a comment

Choose a reason for hiding this comment

gibber9809 commented Mar 2, 2024 •

edited

Loading

wraymo Mar 8, 2024 •

edited

Loading

wraymo Mar 13, 2024 •

edited

Loading

gibber9809 Mar 13, 2024 •

edited

Loading