Skip to content

in_tail: only rely on fstat() to detect file rotation #10280

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 7 commits into
base: master
Choose a base branch
from

Conversation

david-garcia-garcia
Copy link

@david-garcia-garcia david-garcia-garcia commented Apr 30, 2025

When reading logs from NFS the results obtained by calling fstat() might be outdated. This leads to looped ingest as the plugin detects truncation over and over again when the file has not been truncated because it compares the current stream offset with the file size reported in the metadata (which is stale).

This PR solves this by relying exclusively on the information provided by fstat() to decide wether or not to rotate, where before it used offset and fstat()->file_size (offset could be larger the the file_size reported by fstat as fstat comes from cache and takes some time to update).

Fixes #10276

Enter [N/A] in the box, if an item is not applicable to your change.

Testing
Before we can approve your change; please submit the following in a comment:

  • Example configuration file for the change
  • Debug log output from testing the change
  • Attached Valgrind output that shows no leaks or memory corruption was found

If this is a change to packaging of containers or native binaries then please confirm it works for all targets.

  • Run local packaging test showing all targets (including any new ones) build.
  • Set ok-package-test label to test for all targets (requires maintainer to do).

Documentation

  • Documentation required for this feature

Backporting

  • Backport to latest stable release.

Fluent Bit is licensed under Apache 2.0, by submitting this pull request I understand that this code will be released under the terms of that license.

@david-garcia-garcia david-garcia-garcia changed the title [WIP] New option for Tail input "truncate_min_threshold" [WIP] in_tail: new option for Tail input "truncate_min_threshold" Apr 30, 2025
@david-garcia-garcia
Copy link
Author

david-garcia-garcia commented May 2, 2025

Example configuration file

[SERVICE]
    Flush                     6
    Log_Level                 debug
    Parsers_File              parsers.conf
    log_file                  /dev/stdout  

[INPUT]
    Name             tail
    DB               /data/flb_logs.db
    DB.locking       true
    Path             /processlogs/aks-dev/traefik/traefikapp/access.log
    Parser           json_traefik
    Tag              logs.traefik
    Refresh_Interval 120
    Mem_Buf_Limit    16MB
    Skip_Long_Lines  On
    Inotify_Watcher  false
    Read_from_Head   false
    Buffer_Max_Size  64k
    Offset_Key       offset
    Path_Key         path
    read_newly_discovered_files_from_head false

[FILTER]
    Name modify
    Match *
    Set cluster_name ${NEW_RELIC_METADATA_KUBERNETES_CLUSTER_NAME}

[OUTPUT]
    Name            newrelic
    Match           *
    Alias           newrelic-logs-forwarder
    licenseKey      ${LICENSE_KEY}
    endpoint        ${ENDPOINT}
    SendMetrics     ${SEND_METRICS}

Debug logs file truncation

[2025/05/03 05:20:57] [debug] [input:tail:tail.1] adjust_counters: inode=9223444303974498304 file truncated /processlogs/aks-dev/traefik/traefikapp/access.log (diff: -881916 bytes)                                                                                                 

@david-garcia-garcia david-garcia-garcia changed the title [WIP] in_tail: new option for Tail input "truncate_min_threshold" [WIP] in_tail: only rely on fstat() to detect file rotation May 3, 2025
@david-garcia-garcia david-garcia-garcia changed the title [WIP] in_tail: only rely on fstat() to detect file rotation in_tail: only rely on fstat() to detect file rotation May 3, 2025
@edsiper
Copy link
Member

edsiper commented May 8, 2025

thanks for contributing this PR.

This is a very sensitive change (actually this is one of the parts of the plugin I avoid to touch :D ), wondering how we can extend testing to avoid regressions, I remember there are a couple of corner cases.

adding @leonardo-albertovich as extra eyes for this one.

@edsiper edsiper added this to the Fluent Bit Next milestone May 8, 2025
@@ -119,15 +119,21 @@ static int tail_fs_check(struct flb_input_instance *ins,
continue;
}

int64_t size_delta = st.st_size - file->size;
if (size_delta != 0) {
file->size = st.st_size;
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the only place in this PR where the change is not restricted to the method scope and might have any impact outside the method. This file->size assignment was not here before (and it could be removed as we only need size_delta to detect truncation). I introduced this only for consistency with the other implementations.

@david-garcia-garcia
Copy link
Author

@edsiper

this is a very sensitive change (actually this is one of the parts of the plugin I avoid to touch :D )

I had the feeling file_tail might be one of those things that was implemented at first and everyone uses. The kind of thing you don't want to change if it's not broken. Reading logs from NFS might no be the most common use case, but as people move workloads to the cloud, shared remote storage will become more common. Plus this bug only surfaces with some particular configurations of NFS where file metadata is cached.

The PR is actually the second approach to solve the issue (I had some fix deployed that worked, but it was too complex and overthought). The current change proposal is very scope limited and its impact can be easily grasped by reading the code changes.

The only possible side effect of this change is that on NFS with metadata cache detecting truncation might be delayed until the metadata cache is updated. But this is in any case much better than having false truncation/rotation detections that lead to repeated log ingest.

I've been running a fork with this fix in production for some days now without issues.

wondering how we can extend testing to avoid regressions

As per what this PR tries to solve I don't think there is a feasible way of testing it as it relies on the stream offset and fstat() providing non synchronized information which is something you cannot easily artificially produce.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Tail Input Incorrectly Detects rotation in NFS with Metadata Cache
2 participants