Skip to content

Conversation

exupero
Copy link
Contributor

@exupero exupero commented Sep 25, 2025

What changes are proposed in this pull request?

#6288 allows asynchronous background tasks to accumulate without limit, on the assumption that predictions on a batch would take longer than saving that batch to the DB. In cases where saving to the DB takes longer, the number of async tasks can grow without limit and could cause the process to run out of memory.

This PR wraps the batch iterator in a new utility function that limits the number background tasks. After each iteration of the source iterator, if there are more background tasks than the specified limit, tasks are waited for until the backlog is reduced to the limit, at which point the next item is taken from the iterator.

This approach is open to discussion. If writing to the DB is the bottleneck and creates backpressure on the prediction pipeline, then as soon as enough async tasks accumulate, performance degenerates to the same throughput as never using async tasks. In that case, we might prefer simplicity and revert #6288 to go back a single thread.

How is this patch tested? If it is not, please explain why.

Ran the following script on a machine with a GPU, with some added logging that recorded how many tasks had accumulated. Logs showed the expected limits were observed.

I also tried using memory-profiler, but the profiles showed the same memory characteristics as the original code. I suspect memory usage is dominated by the downloaded samples, not by the data needed for the async tasks, so the effect of limiting the queue doesn't show:
Figure_1

@jacobsela has some more sophisticated testing he might be able to apply.

Release Notes

Is this a user-facing change that should be mentioned in the release notes?

  • No. You can skip the rest of this section.
  • Yes. Give a description of this change to be included in the release
    notes for FiftyOne users.

(Details in 1-2 sentences. You can just refer to another PR with a description
if this PR is part of a larger change.)

What areas of FiftyOne does this PR affect?

  • App: FiftyOne application changes
  • Build: Build and test infrastructure changes
  • Core: Core fiftyone Python library changes
  • Documentation: FiftyOne documentation changes
  • Other

@exupero exupero requested a review from jacobsela September 25, 2025 21:45
Copy link
Contributor

coderabbitai bot commented Sep 25, 2025

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

✨ Finishing touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch bugfix/ess/limit-async-queue

Comment @coderabbitai help to get the list of available commands and usage tips.

@exupero
Copy link
Contributor Author

exupero commented Sep 25, 2025

@jacobsela are you able to run your analysis on this branch to see if it makes any difference?

yield submit

for future in _futures:
for future in futures:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These lines are an opportunistic cleanup in the original code.

With this branch this function is no longer used in fiftyone, so it could be removed, but the code has been released so there could be third-party uses of it. Suggestions on whether to keep it or get rid of it?

Comment on lines 500 to 501
limit=10,
max_workers=1,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some defaults. Ideally these would be configurable by the user.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is the reason for limit to 10?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Arbitrary. Didn't have any reason to choose something different. @jacobsela might be able to weigh in with an informed option.

return future

for item in iterator:
yield submit, item
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this supposed to be submit(item)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No. This is yielding the submit function and the current item to the consumer loop, so the consumer can call submit where it wants and with the function it wants to run async.

@exupero
Copy link
Contributor Author

exupero commented Oct 2, 2025

@jacobsela any word on whether this branch makes any different in your benchmarks?

@exupero exupero force-pushed the bugfix/ess/limit-async-queue branch from 6e348e9 to b1596c7 Compare October 6, 2025 19:52
@exupero
Copy link
Contributor Author

exupero commented Oct 7, 2025

Profiled memory on this branch with @jacobsela's testing script and found that limiting the number of background tasks doesn't seem to help. Even when using a limit of 0, which should be behaviorally equivalent to develop before #6288 (i.e. each batch's prediction is blocked on the previous batch being saved), memory usage climbs almost as high as without a limit. Increasing the number of workers speeds up execution but doesn't use that much less memory.
visualization(14)
(Note: the light blue run on 2000 samples, "queue limit 2, 2 workers", was aborted before finishing.)

@exupero
Copy link
Contributor Author

exupero commented Oct 7, 2025

Considering the results found for #6389, the increasing memory usage appears to be the result of an interaction between async_executor and SaveContext (maybe the SaveContext instance doesn't write batches on the same schedule and flush data quite so soon?).

@exupero exupero closed this Oct 7, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants