Skip to content

fix(file source): high CPU usage after async file server migration#25064

Open
fcfangcc wants to merge 2 commits intovectordotdev:masterfrom
fcfangcc:master-fcfangcc
Open

fix(file source): high CPU usage after async file server migration#25064
fcfangcc wants to merge 2 commits intovectordotdev:masterfrom
fcfangcc:master-fcfangcc

Conversation

@fcfangcc
Copy link
Copy Markdown

@fcfangcc fcfangcc commented Mar 30, 2026

Summary

This PR addresses a CPU regression introduced by the async file source changes, #24058 .

The async migration converted FileWatcher's reader from synchronous std::io::BufRead to tokio::io::AsyncBufRead. The critical behavioral difference is that tokio::io::BufReader::fill_buf().await returns immediately with an empty buffer when the underlying file is at EOF — it is a non-blocking, zero-cost poll that completes in the same tick.

Previously, the file server's main loop ran inside spawn_blocking and relied on a global backoff (sleep up to 2048ms when no bytes were read globally). While the per-file fill_buf() was also non-blocking at EOF in the sync version, the overall loop cadence was naturally throttled by the blocking thread context and the global sleep.

After the async conversion, the main loop runs as a normal async task. On each iteration it calls should_read()read_line()fill_buf().await for every watched file. For idle files at EOF, each read_line() call completes almost instantly but still incurs the overhead of the async state machine, buffer checks, and the function call chain (read_lineread_until_with_max_sizefill_buf). With hundreds of idle files, this tight loop burns significant CPU doing no useful work.

The global backoff (backoff_cap, max 2048ms) only kicks in after iterating through all watchers, so it cannot prevent the per-file polling overhead within each loop iteration.

What Changed

1. Add per-watcher EOF backoff

FileWatcher now backs off after repeated EOF reads instead of polling at the same rate while the file remains idle.

  • The backoff grows for repeated EOF probes.
  • The backoff resets immediately after a successful read.
  • Active files keep their previous responsiveness.

This reduces unnecessary wakeups and polling work when a small number of files remain active and many others have already reached EOF.

2. Remove per-read boxing from the shared async buffer read path

The shared read_until_with_max_size helper now takes a borrowed reader directly instead of wrapping the reader for each call.

This keeps the outer reader abstraction intact, but removes extra work from the line-read hot path.

3. Add a benchmark for the regression scenario

A new benchmark was added for one active file together with many idle watched files. This is the workload shape that exposed the regression.

The benchmark lives in benches/files.rs under the files/idle_watchers group and measures:

  • 0 idle watched files
  • 128 idle watched files
  • 512 idle watched files

Vector configuration

How did you test this PR?

Change Type

  • Bug fix
  • New feature
  • Dependencies
  • Non-functional (chore, refactoring, docs)
  • Performance

Is this a breaking change?

  • Yes
  • No

Does this PR include user facing changes?

  • Yes. Please add a changelog fragment based on our guidelines.
  • No. A maintainer will apply the no-changelog label to this PR.

References

Notes

  • Please read our Vector contributor resources.
  • Do not hesitate to use @vectordotdev/vector to reach out to us regarding this PR.
  • Some CI checks run only after we manually approve them.
    • We recommend adding a pre-push hook, please see this template.
    • Alternatively, we recommend running the following locally before pushing to the remote branch:
      • make fmt
      • make check-clippy (if there are failures it's possible some of them can be fixed with make clippy-fix)
      • make test
  • After a review is requested, please avoid force pushes to help us review incrementally.
    • Feel free to push as many commits as you want. They will be squashed into one before merging.
    • For example, you can run git merge origin master and git push.
  • If this PR introduces changes Vector dependencies (modifies Cargo.lock), please
    run make build-licenses to regenerate the license inventory and commit the changes (if any). More details on the dd-rust-license-tool.

@fcfangcc fcfangcc requested a review from a team as a code owner March 30, 2026 08:56
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Mar 30, 2026

All contributors have signed the CLA ✍️ ✅
Posted by the CLA Assistant Lite bot.

@fcfangcc fcfangcc changed the title fix(file source): back off idle EOF polling and remove per-read boxing fix(file source): high CPU usage after async file server migration Mar 31, 2026
@fcfangcc

This comment was marked as spam.

@github-actions github-actions bot added the domain: ci Anything related to Vector's CI environment label Mar 31, 2026
@github-actions github-actions bot removed the domain: ci Anything related to Vector's CI environment label Mar 31, 2026
@fcfangcc
Copy link
Copy Markdown
Author

I have read the CLA Document and I hereby sign the CLA

@bruceg
Copy link
Copy Markdown
Member

bruceg commented Mar 31, 2026

With hundreds of idle files, this tight loop burns significant CPU doing no useful work.

The file regression tests only manages active files. I wonder if there would be a way to inject some static files into the test directory to replicate this. It would be nice to be able to demonstrate this is resolved in a test that is run regularly, as the regression tests are, rather than the benchmarks which aren't.

@fcfangcc
Copy link
Copy Markdown
Author

@bruceg It’s somewhat challenging because it doesn’t actually slow down reading (assuming sufficient resources)—it merely consumes more resources. I don’t currently have a good idea for this.

Regression tests aren’t designed to measure performance—they can, at best, verify whether the backoff mechanism is triggered.

@pront
Copy link
Copy Markdown
Member

pront commented Mar 31, 2026

@codex review

@chatgpt-codex-connector
Copy link
Copy Markdown

Codex Review: Didn't find any major issues. Swish!

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Copy link
Copy Markdown
Contributor

@thomasqueirozb thomasqueirozb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice fix, thanks!

Note that this technically considered user facing since it affects Vector behavior so we add a changelog for these types of fixes. Will merge once the changelog is added

Comment on lines +22 to +23
const EOF_READ_BACKOFF_MIN: Duration = Duration::from_millis(1);
const EOF_READ_BACKOFF_MAX: Duration = Duration::from_millis(250);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These look like sensible defaults, we can tweak them later if needed

@fcfangcc
Copy link
Copy Markdown
Author

fcfangcc commented Apr 1, 2026

@thomasqueirozb changelog added.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants