-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(file source)!: use uncompressed content for fingerprinting files (lines and ignored_header_bytes) #22050
base: master
Are you sure you want to change the base?
Conversation
ccdfe60
to
ae39cc0
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work @roykim98 ! I left a couple of comments about error handling, but overall this approach seems good to me. I appreciate the added tests.
lib/file-source/src/fingerprinter.rs
Outdated
let mut magic = vec![0u8; magic_header_bytes.len()]; | ||
|
||
if fp.read_exact(&mut magic).is_ok() | ||
&& fp.seek(SeekFrom::Start(0)).is_ok() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we always need to seek back even if the read_exact
failed, no?
Also, if the seek fails, I think we should return an error (i.e. change the function signature here to return a Result
) since the file pointer will be in an invalid state.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that's a good point. Will change.
lib/file-source/src/fingerprinter.rs
Outdated
let magic_header_bytes = compression_algorithm.magic_header_bytes(); | ||
let mut magic = vec![0u8; magic_header_bytes.len()]; | ||
|
||
if fp.read_exact(&mut magic).is_ok() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this fails, I think it could be due to one of two reasons:
- Not enough bytes
- Read interrupted
If the read is interrupted, I think we likely want to return an error since we weren't able to detect whether the file is compressed.
If there aren't enough bytes, we probably want to try again later since even uncompressed files, I think, should be a minimum of two bytes (a character and a newline) 🤔 The risk is that the file is actually compressed, there just aren't enough bytes to determine that yet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Errors from read_exact
are documented here: https://doc.rust-lang.org/std/io/trait.Read.html#method.read_exact
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @jszwedko , I appreciate the review. I think I follow most of your points
I guess it would make sense most to:
- Check each magic bytes header in the file. If find one that matches an algorithm, then short-circuit and return the corresponding enum
- If all header checks fail, it means that this file is either:
- Uncompressed, in which case we should return
None
, so we can fingerprint as-is - Using an unsupported compression algorithm, in which case we should return
None
, so we can fingerprint as-is - Using a supported compression algorithm, but the read was interrupted--return an error if read_exact fails for
[ErrorKind::Interrupted]
. - Using a supported compression algorithm, but there aren't enough bytes to determine that yet
- So in this scenario, I guess you're suggesting this is some kind of compressed stream that doesn't have a header written to it yet, and we should throw an error, as this would manifest as
[ErrorKind::UnexpectedEof]
. - There is one tiny edge case here that will cause a potentially valid assessment to instead fail, which is if the compression algorithm we're checking happens to have a bigger header than the intended compression algorithm. e.g.) zstd's magic bytes are 4 bytes, assume the file is 3 bytes gzipped (gzip headers are 2 bytes, 1 byte of compressed content). If we throw an error at zstd before checking gzip, then we'd needlessly try again.
- I guess this can be prevented by deterministically checking from smallest compression algorithms to the largest.
- So in this scenario, I guess you're suggesting this is some kind of compressed stream that doesn't have a header written to it yet, and we should throw an error, as this would manifest as
- Uncompressed, in which case we should return
Lastly:
we probably want to try again later
I presume this is covered up the stack if fingerprinting fails? I can investigate myself, but if you happen to have the answer handy, it'd be helpful
Made some final tests here:
Made a 3-line file called
Loaded them up in the source, and then compressed it to a new inode, Created a similarly named file
All had the same fingerprint
|
Signed-off-by: Jesse Szwedko <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @roykim98 ! I refactored the check
function, but otherwise this looks good to me! I appreciate the thorough testing.
Signed-off-by: Jesse Szwedko <[email protected]>
fp.seek(SeekFrom::Start(0))?; | ||
fp.read_exact(&mut magic)?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jszwedko For my own understanding, wouldn't you still want to reset the position in the event that read_exact
fails? That's sort of why I did the round-about way of checking the errors after executing both functions instead of using the syntactic sugar for ?
Head branch was pushed to by a user without write access
One of the workflows was failing due to some whitespaces in the *.cue files, which I just removed, but it revoked the automerge label and build checks. Looks like you may have to re-add the automerge label, sorry for the inconvenience @jszwedko . Ran |
These additional checks pass locally. |
Also ran these |
Summary
This PR addresses #13193.
This PR will handle the case where attempting to fingerprint compressed content. It will also handle a use case with log rotation, where an uncompressed active file might be monitored and then rotated into a compressed file with a new inode by a log rotation service. By comparing the uncompressed lines as a source of truth, it prevents messages from processing the files 2X.
I chose to explicitly check the magic bytes header by enumerating the signatures to make checking the different headers consistent. I checked a couple compression libraries, and I figured it would be complicated to try to move the ownership of a reader into multiple decoders with inconsistent functions that did not all permit easy checking of the magic header. Of course, if only gzip is expected to be supported for the foreseeable future, this is a moot point and we can just fall back to instantiating the gzip library to check
BREAKING CHANGE: When sourcing from compressed files,
ignored_header_bytes
will no longer look at the compressed file's bytes, which would include the magic bytes for the compression header. Instead, it will ignore the bytes from the uncompressed content. Similarly,lines
will no longer look for new line delimiters in the compressed content, but the uncompressed content. Arguably, both of these current mechanisms as a bug, as compressed content would not have any explicit lines or intentional header aside from the magic bytes.Change Type
Is this a breaking change?
How did you test this PR?
Unit tests
Does this PR include user facing changes?
Checklist
References
file
source:checksum
fingerprint is not correct with gzipped files #13193