-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
file
source: checksum
fingerprint is not correct with gzipped files
#13193
Comments
Hi @jszwedko ! Do you think the team would have some bandwidth to check on this issue? |
I wonder if a potential relatively easy solution would be to allow configuring Unfortunately, this proposal would not detect when a previously-read plain text file has been compressed, thus having the same checksum. Doing a checksum of the data within the file is a much more involved modification. |
That's an interesting idea. I think the source should have the same semantics for the fingerprinter, no matter if the files are compressed or not, for a consistent user experience and to avoid "surprising behaviour". However I understand doing it right is a larger fish to fry and the above suggestion might be enough in the meantime. |
Confirming this is still an issue on vector 0.28.0 |
(#6338) it's sad, I can't read small gzip files. |
Yes, you can use the |
With 'remove_after_secs = 1' new files have the same inode. |
Inode reuse is a complication here unfortunately 🙁 |
@jszwedko nice to see some activity in this issue! :) This is still an important issue for us and this time we have been just hoping for this bug to not trigger. |
Hey! I could see introducing that feature, generally, to allow fingerprinting to be based on bytes rather than lines, but I am a bit concerned about the caveat that Bruce mentioned:
As this might be a common occurrence during file rotation where previously plaintext files would be compressed. To accurately fingerprint them, Vector would need to read the head of the file, uncompressed, to compare with the fingerprint from before the file was rotated. I think the better solution is to have Vector uncompress the head of the file to fingerprint it. |
Oh, I didn't realise until now what Bruce really meant there 🤦. I see now that the issue would be plain text files previously processed by Vector that can get compressed during rotation and thus should be still ignored by Vector after compression. If I understand correctly, given that Vector does not currently fingerprint based on the content of compressed data, this behaviour (not detecting a plaintext file that gets compressed) should be anyway happening today in Vector? Especially considering the bug I reported in this issue where it is clear that the fingerprinter acts on raw compressed data. In other words, as of today, users should not be mixing compressed and uncompressed files in the same file source that can be the same files already processed. Users should be using the Therefore, I think the suggestion from Bruce (adding a bytes-based checksum fingerprinter) would actually not change this behaviour and thus should not impact existing users anyway. In our use case, files are never presented in plain text first. TL;DR; I do think Bruce's suggestion woud help us (and others processing compressed files) without hurting existing users. |
👍 agreed, it does seem to be an improvement even if it isn't a complete fix. |
I'll be taking this up to handle a case of rotated log files. Namely, the situation where vector's file source monitors an uncompressed text file and computes a crc based on the uncompressed lines, and then a log rotation library rotates the uncompressed text file into a compressed text file. This compressed text file it typically done via a rename of the original file and then a compression to a new file, so inodes cannot be used to track this case. I'll also look to update the documents with this change. |
A note for the community
Problem
The
file
source component is not reading gzipped files correctly for the purpose of fingerprinting using thechecksum
strategy. In particular it seems that it is countinglines
in the compressed data instead of the decompressed data. Therefore, when the compresed data does not have newline characters, or fewer newlines than requested in thelines
configuration, Vector refuses to process the file with a "file too small for fingerprinting" error.Consider the following Vector pipeline configuration:
The sample
input.txt.gz
file attached down below contains 200 lines of text of the formline X
. Theinput.txt
file attached is just the decompressed version of the former for further testing. I am also including aninput2.txt.gz
file with just 10 lines and its corresponding decompressed versioninput2.txt
for a further demonstration down below.If you run the above Vector pipeline using the
input.txt.gz
file and the default lines (1
), you obtain:Which is correct and demonstrates that Vector can indeed transparently read gzipped files.
If we ask the fingerprinter to skip
4
lines before performing achecksum
:It can be seen that Vector cannot fingerprint the file due to it being "too small".
However, if we configure the fingerprinter to skip
3
lines, then it works again:Where is this magic number
3
coming from?If we examine the
hexdump
of the compressedinput.txt.gz
file (GitHub won't color here 😢):It can be seen that the file casually contains three newline characters
\x0a
, explaining the behaviour.We noted that if the gzipped file has no newline characters (no matter how big it is), then the fingerprinter will always report the file as "too small" and Vector will never process said file. This is how we discovered this issue. We tried setting
lines: 0
to no joy. The secondinput2.txt.gz
file is an example without newlines that can't be processed by Vector due to this.In summary, we believe that the
lines
configuration should operate over the decompressed data, as otherwise it doesn't make much sense due to compressed data being binary and not text-based.Configuration
No response
Version
0.22.2 and tested back as far as 0.17.3
Debug Output
No response
Example Data
input.txt.gz
input.txt
input2.txt.gz
input2.txt
Additional Context
No response
References
No response
The text was updated successfully, but these errors were encountered: