fix(file source): high CPU usage after async file server migration#25064
fix(file source): high CPU usage after async file server migration#25064fcfangcc wants to merge 2 commits intovectordotdev:masterfrom
Conversation
|
All contributors have signed the CLA ✍️ ✅ |
This comment was marked as spam.
This comment was marked as spam.
b01a0e6 to
1621953
Compare
1621953 to
2fc6117
Compare
|
I have read the CLA Document and I hereby sign the CLA |
2fc6117 to
250d6bd
Compare
The file regression tests only manages active files. I wonder if there would be a way to inject some static files into the test directory to replicate this. It would be nice to be able to demonstrate this is resolved in a test that is run regularly, as the regression tests are, rather than the benchmarks which aren't. |
|
@bruceg It’s somewhat challenging because it doesn’t actually slow down reading (assuming sufficient resources)—it merely consumes more resources. I don’t currently have a good idea for this. Regression tests aren’t designed to measure performance—they can, at best, verify whether the backoff mechanism is triggered. |
|
@codex review |
|
Codex Review: Didn't find any major issues. Swish! ℹ️ About Codex in GitHubYour team has set up Codex to review pull requests in this repo. Reviews are triggered when you
If Codex has suggestions, it will comment; otherwise it will react with 👍. Codex can also answer questions or update the PR. Try commenting "@codex address that feedback". |
| const EOF_READ_BACKOFF_MIN: Duration = Duration::from_millis(1); | ||
| const EOF_READ_BACKOFF_MAX: Duration = Duration::from_millis(250); |
There was a problem hiding this comment.
These look like sensible defaults, we can tweak them later if needed
1a75843 to
a3e9c20
Compare
|
@thomasqueirozb changelog added. |
Summary
This PR addresses a CPU regression introduced by the async file source changes, #24058 .
The async migration converted
FileWatcher's reader from synchronousstd::io::BufReadtotokio::io::AsyncBufRead. The critical behavioral difference is thattokio::io::BufReader::fill_buf().awaitreturns immediately with an empty buffer when the underlying file is at EOF — it is a non-blocking, zero-cost poll that completes in the same tick.Previously, the file server's main loop ran inside
spawn_blockingand relied on a global backoff (sleep up to 2048ms when no bytes were read globally). While the per-filefill_buf()was also non-blocking at EOF in the sync version, the overall loop cadence was naturally throttled by the blocking thread context and the global sleep.After the async conversion, the main loop runs as a normal async task. On each iteration it calls
should_read()→read_line()→fill_buf().awaitfor every watched file. For idle files at EOF, eachread_line()call completes almost instantly but still incurs the overhead of the async state machine, buffer checks, and the function call chain (read_line→read_until_with_max_size→fill_buf). With hundreds of idle files, this tight loop burns significant CPU doing no useful work.The global backoff (
backoff_cap, max 2048ms) only kicks in after iterating through all watchers, so it cannot prevent the per-file polling overhead within each loop iteration.What Changed
1. Add per-watcher EOF backoff
FileWatchernow backs off after repeated EOF reads instead of polling at the same rate while the file remains idle.This reduces unnecessary wakeups and polling work when a small number of files remain active and many others have already reached EOF.
2. Remove per-read boxing from the shared async buffer read path
The shared
read_until_with_max_sizehelper now takes a borrowed reader directly instead of wrapping the reader for each call.This keeps the outer reader abstraction intact, but removes extra work from the line-read hot path.
3. Add a benchmark for the regression scenario
A new benchmark was added for one active file together with many idle watched files. This is the workload shape that exposed the regression.
The benchmark lives in
benches/files.rsunder thefiles/idle_watchersgroup and measures:0idle watched files128idle watched files512idle watched filesVector configuration
How did you test this PR?
Change Type
Is this a breaking change?
Does this PR include user facing changes?
no-changeloglabel to this PR.References
Notes
@vectordotdev/vectorto reach out to us regarding this PR.pre-pushhook, please see this template.make fmtmake check-clippy(if there are failures it's possible some of them can be fixed withmake clippy-fix)make testgit merge origin masterandgit push.Cargo.lock), pleaserun
make build-licensesto regenerate the license inventory and commit the changes (if any). More details on the dd-rust-license-tool.