Skip to content

Commit

Permalink
Update documentation
Browse files Browse the repository at this point in the history
  • Loading branch information
roykim98 committed Dec 18, 2024
1 parent 5d11fd8 commit ae39cc0
Show file tree
Hide file tree
Showing 4 changed files with 12 additions and 3 deletions.
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
Changes the fingerprinter for file sources to use uncompressed file content

Check warning on line 1 in changelog.d/22050-fingerprint-uncompressed-file-content.fix.md

View workflow job for this annotation

GitHub Actions / Check Spelling

`fingerprinter` is not a recognized word. (unrecognized-spelling)
as a source of truth when fingerprinting lines and checking
ignored_header_bytes. Previously this was using the compressed bytes. Only gzip
supported for now.

authors: roykim98
3 changes: 2 additions & 1 deletion src/sources/file.rs
Original file line number Diff line number Diff line change
Expand Up @@ -298,6 +298,7 @@ pub enum FingerprintConfig {
bytes: Option<usize>,

/// The number of bytes to skip ahead (or ignore) when reading the data used for generating the checksum.
/// If the file is compressed, the number of bytes refer to the header in the uncompressed content.
///
/// This can be helpful if all files share a common header that should be skipped.
#[serde(default = "default_ignored_header_bytes")]
Expand All @@ -306,7 +307,7 @@ pub enum FingerprintConfig {

/// The number of lines to read for generating the checksum.
///
/// If your files share a common header that is not always a fixed size,
/// The number of lines are determined from the uncompressed content if the file is compressed.
///
/// If the file has less than this amount of lines, it won’t be read at all.
#[serde(default = "default_lines")]
Expand Down
3 changes: 2 additions & 1 deletion website/cue/reference/components/sources/base/file.cue
Original file line number Diff line number Diff line change
Expand Up @@ -98,6 +98,7 @@ base: components: sources: file: configuration: {
ignored_header_bytes: {
description: """
The number of bytes to skip ahead (or ignore) when reading the data used for generating the checksum.
If the file is compressed, the number of bytes refer to the header in the uncompressed content.
This can be helpful if all files share a common header that should be skipped.
"""
Expand All @@ -112,7 +113,7 @@ base: components: sources: file: configuration: {
description: """
The number of lines to read for generating the checksum.
If your files share a common header that is not always a fixed size,
The number of lines are determined from the uncompressed content if the file is compressed.
If the file has less than this amount of lines, it won’t be read at all.
"""
Expand Down
3 changes: 2 additions & 1 deletion website/cue/reference/components/sources/file.cue
Original file line number Diff line number Diff line change
Expand Up @@ -219,7 +219,8 @@ components: sources: file: {
check](\(urls.crc)) (CRC) on the first N lines of the file. This serves as a
*fingerprint* that uniquely identifies the file. The number of lines, N, that are
read can be set using the [`fingerprint.lines`](#fingerprint.lines) and
[`fingerprint.ignored_header_bytes`](#fingerprint.ignored_header_bytes) options.
[`fingerprint.ignored_header_bytes`](#fingerprint.ignored_header_bytes) options. Note
that for compressed files, these lines and header bytes refer to the uncompressed content.
This strategy avoids the common pitfalls associated with using device and inode
names since inode names can be reused across files. This enables Vector to properly
Expand Down

0 comments on commit ae39cc0

Please sign in to comment.