-
Notifications
You must be signed in to change notification settings - Fork 679
[DON'T MERGE] FOEPD-1977: Move label post-processing to background thread #6307
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
WalkthroughRefactors image model data loader prediction to a two-stage pipeline: compute raw outputs via model.predict_all, then apply a postprocessing hook (output_processor) using image sizes. Updates async submission and save_batch signatures to handle raw outputs plus image_sizes, and temporarily disables model._output_processor during the forward pass. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
actor User
participant DataLoader as DataLoader
participant Model as Model
participant OutputProc as output_processor
participant Saver as save_batch / submit
User->>DataLoader: Iterate batches
DataLoader->>Model: imgs (+ samples if needed)<br/>fo_image_size (optional)
rect rgba(200,220,255,0.3)
note right of Model: Temporarily disable model._output_processor
Model-->>DataLoader: output = predict_all(imgs[, samples])
end
DataLoader->>OutputProc: output, image_sizes,<br/>confidence_thresh
OutputProc-->>DataLoader: labels_batch
alt synchronous save
DataLoader->>Saver: save_batch(sample_batch, output, image_sizes)
Saver->>OutputProc: Postprocess to labels
Saver-->>User: Saved labels
else async submit
DataLoader->>Saver: submit(save_batch, sample_batch,<br/>output, image_sizes)
Saver-->>User: Scheduled
end
note over OutputProc,Saver: image_sizes defaults to [(None, None)] if missing
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Poem
✨ Finishing Touches
🧪 Generate unit tests
🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. CodeRabbit Commands (Invoked using PR/Issue comments)Type Other keywords and placeholders
Status, Documentation and Community
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (3)
fiftyone/utils/torch.py (2)
868-874: Add parameter documentation for the new public methods.The new public methods lack docstrings. Consider adding documentation to clarify their purpose and parameters, especially
image_sizeswhich appears to be a list of tuples.Add docstrings like:
def preprocess_and_forward_pass(self, imgs): """Performs preprocessing and forward pass on the given images. Args: imgs: the batch of images to process Returns: the raw model output """ return self._preprocess_and_forward_pass(imgs)
876-880: Add parameter documentation for postprocess method.Similar to the above, the
postprocessmethod would benefit from documentation.Add docstring like:
def postprocess(self, output, image_sizes): """Postprocesses the raw model output. Args: output: the raw model output from preprocess_and_forward_pass image_sizes: a list of (width, height) tuples for each image Returns: the processed predictions """ return self._postprocess(output, image_sizes)fiftyone/core/models.py (1)
511-511: Consider validating the extracted image_sizes.While the fallback to
[(None, None)]provides backward compatibility, consider adding validation to ensure the extractedimage_sizeslist has the expected length matching the batch size.Consider adding validation after line 511:
image_sizes = imgs.pop("fo_image_size", [(None, None)]) if image_sizes and len(image_sizes) != len(sample_batch): logger.warning( "Image sizes count (%d) doesn't match batch size (%d)", len(image_sizes), len(sample_batch) )
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (3)
fiftyone/core/models.py(2 hunks)fiftyone/utils/torch.py(1 hunks)fiftyone/utils/transformers.py(2 hunks)
🧰 Additional context used
🧠 Learnings (1)
📓 Common learnings
Learnt from: jacobsela
PR: voxel51/fiftyone#5761
File: fiftyone/utils/transformers.py:1288-1297
Timestamp: 2025-05-02T04:06:17.429Z
Learning: The `_get_image_size` function in transformers.py is intentionally kept consistent with the implementation in TorchImageModel, and should be refactored as part of a broader effort rather than in isolation.
🧬 Code graph analysis (3)
fiftyone/utils/torch.py (1)
fiftyone/utils/transformers.py (4)
_preprocess_and_forward_pass(657-666)_preprocess_and_forward_pass(950-951)_postprocess(668-676)_postprocess(953-954)
fiftyone/core/models.py (2)
fiftyone/utils/torch.py (2)
postprocess(876-877)preprocess_and_forward_pass(868-869)fiftyone/core/utils.py (1)
submit(3174-3177)
fiftyone/utils/transformers.py (2)
fiftyone/utils/torch.py (10)
_preprocess_and_forward_pass(871-874)preprocess(800-802)preprocess(805-806)collate_fn(777-797)transforms(761-765)device(814-816)_forward_pass(951-952)_postprocess(879-880)_predict_all(390-400)_predict_all(902-949)fiftyone/utils/ultralytics.py (3)
collate_fn(444-453)_forward_pass(529-531)_predict_all(588-621)
🪛 Pylint (3.3.7)
fiftyone/utils/transformers.py
[error] 659-659: self.transforms is not callable
(E1102)
[refactor] 669-676: Unnecessary "else" after "return", remove the "else" and de-indent the code inside it
(R1705)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (12)
- GitHub Check: test-windows / test-python (windows-latest, 3.11)
- GitHub Check: test-windows / test-python (windows-latest, 3.9)
- GitHub Check: test-windows / test-python (windows-latest, 3.10)
- GitHub Check: test-windows / test-python (windows-latest, 3.12)
- GitHub Check: test / test-python (ubuntu-latest-m, 3.11)
- GitHub Check: test / test-python (ubuntu-latest-m, 3.12)
- GitHub Check: test / test-python (ubuntu-latest-m, 3.9)
- GitHub Check: test / test-app
- GitHub Check: test / test-python (ubuntu-latest-m, 3.10)
- GitHub Check: build / build
- GitHub Check: e2e / test-e2e
- GitHub Check: build
🔇 Additional comments (6)
fiftyone/utils/transformers.py (4)
657-666: LGTM! Clean refactoring of the prediction pipeline.The new
_preprocess_and_forward_passmethod effectively consolidates preprocessing and forward pass logic into a single method, maintaining consistency with the two-stage pipeline architecture introduced across the codebase.
668-676: LGTM! Well-structured postprocessing method.The
_postprocessmethod cleanly handles both cases - with and without an output processor. The implementation is consistent with the interface defined infiftyone/utils/torch.py.
678-681: LGTM! Proper orchestration of the three-stage pipeline.The updated
_predict_allmethod correctly implements the new pipeline: extract image sizes, run preprocessing/forward pass, then postprocess with sizes. The fallback to[(None, None)]ensures backward compatibility.
950-954: LGTM! Correct delegation pattern for zero-shot models.The wrapper methods properly delegate to the parent class implementation, maintaining consistency across the transformer hierarchy.
fiftyone/core/models.py (2)
489-491: LGTM! Efficient asynchronous postprocessing implementation.The refactored
save_batchfunction correctly receives the raw output and image sizes, then performs postprocessing on the background thread as intended. This change successfully moves the label post-processing work off the main thread.
511-519: LGTM! Proper implementation of the two-stage pipeline.The changes correctly extract image sizes, call
preprocess_and_forward_passto get raw output, and submit both to the async worker for postprocessing. The fallback to[(None, None)]ensures backward compatibility whenfo_image_sizeis not provided.
|
@exupero Let's wait with this PR for a bit, I feel like the last one was rushed. This one is even more sensitive because of the model class changes. Additionally, if we are already changing the model interface, let's move it closer to the final goal of the generic |
|
I was thinking about the numbers for a bit and I think most of the time save in this PR is coming from saving on I/O time between the CPU and GPU. When comparing with the mock PR I put up a few days ago, the diff in computation is as follows:
Those two points combined with the fact that the GPU/CPU I/O in the mock PR was ~50s, roughly the amount of time saved in this PR, lead me to believe that the above assertion is correct. Now, I didn't offload the I/O in the other PR because I was getting a GPU memory leak (that I didn't want to deal with for the mockup) when doing the GPU-CPU I/O call in the other process, see the change in this line. However what this leads me to believe is that we could get the combined effects of both PRs by having the "actual" post processing be on another process while having the I/O be threaded. Assuming we can stack the effects, we would be looking at ~100s for this code (going off the mock PR profiling). |
This is pretty much what I had shown in my poc. However to offload to a separate process and have any performance benefit, you must not be working with samples. You need to directly extract the update from the model outputs and save them to have any benefit and avoid unnecessary requesting due to samples and database connections not being serializable. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we wanted to provide a new syntax where postprocessing is not performed in the main process, then we'd need to support it in a backwards-compatible way in the Model class that would allow apply_model() to continue working for all existing Model instances.
I believe what you want is a similar idea to the existing Model.preprocess == False, which tells Model.transforms to be applied by the data loader and not in predict()/predict_all().
For example, we could have Model.postprocess == None by default, but if a Model provides a non-None Model.postprocess, then predict()/predict_all() is assumed to provide a "raw" output that must be passed to Model.postprocess(). And then that can be called in a separate worker inside apply_model().
| if isinstance(imgs, Exception): | ||
| raise imgs | ||
|
|
||
| image_sizes = imgs.pop("fo_image_size", [(None, None)]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fo_image_size is unique to FiftyOneTransformer so it can't be referenced in fiftyone.core.models.
|
I should have noted that this is a proof of concept needing feedback for the proper way to apply it, so thanks for providing that feedback without my asking 💯. @jacobsela can you provide benchmarks for reloading samples on a separate process? I spiked a branch with that mimicked the serialization and deserialization logic in your #6293, but having to reload samples from the DB appeared to be significantly slower than sharing memory between threads (slower as in my test script appeared to hang but was actually just going through background tasks at 15-20s per task). I might have missed something in your proof of concept, so I can share my spike code if you're interested. My expectation is that the speedup in this branch is due to unblocking the GPU. Currently Does Python cProfile data capture GPU-CPU I/O? If so, what should I look for? Or are you using a different tool to benchmark GPU-CPU I/O? |
|
@jacobsela I see |
Yes. For the code in this PR the CPU to GPU copy happens in the
Sure, I can share something basic, I'll ping you when it's ready. In the meantime note the profiling in the mock PR. This isn't a bottleneck as far as I can tell in our toy benchmark.
Were you serializing/deserializing every sample batch at every iteration? You'll see in the PR I sent that each worker gets a serialized I believe the specific part of
bothers @kaixi-wang about this approach. As far as I understand the concern stems from the repeated
That may be happening to an extent, but I think if this was really the big part here we would see a sharper drop in total time. Let's talk about the profiling in the multiprocessing mock PR when you have a chance. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
♻️ Duplicate comments (1)
fiftyone/core/models.py (1)
519-524: Do not assume dict batches or reference transformer-specific keys; also fix fallback length
- Calling
imgs.pop(...)will raise on non-dict batches (e.g., tensors/lists).- Referencing
"fo_image_size"in core couples to a Transformers-only detail and has been flagged as undesirable.- The fallback
[(None, None)]length mismatches multi-sample batches.Use a non-mutating, generic extraction with a robust fallback derived from sample metadata.
- image_sizes = imgs.pop("fo_image_size", [(None, None)]) + # Best-effort: use model-provided sizes if present; otherwise fall back to sample metadata + if isinstance(imgs, dict): + image_sizes = imgs.get("fo_image_size") + else: + image_sizes = None + if image_sizes is None: + image_sizes = [ + ( + getattr(getattr(s, "metadata", None), "width", None), + getattr(getattr(s, "metadata", None), "height", None), + ) + for s in sample_batch + ]Optional (preferred, future-proof): add a private hook on models that need special handling and remove any direct reference to
"fo_image_size"from core:- if isinstance(imgs, dict): - image_sizes = imgs.get("fo_image_size") + sizes_extractor = getattr(model, "_get_image_sizes_from_batch", None) + image_sizes = sizes_extractor(imgs, sample_batch) if sizes_extractor else NoneThis aligns with prior guidance to keep image-size logic consistent with TorchImageModel and avoid hardcoding dataset keys.
🧹 Nitpick comments (2)
fiftyone/core/models.py (2)
493-499: Background-thread postprocessing looks good; verify thresholding isn’t applied twice
confidence_threshis applied insideoutput_processor(...)and again viasample.add_labels(..., confidence_thresh=...). Ensure only one layer performs filtering to avoid unexpected drops near the boundary.If postprocessing already filters, consider passing
confidence_thresh=Nonetoadd_labels, or omitting the param there. I can draft a follow-up patch once you confirm the intended ownership of thresholding.
525-527: Avoid holding GPU tensors across the handoff to the background threadIf
outputcontains GPU tensors, submitting them to the executor can keep GPU memory alive until postprocessing completes, increasing peak memory and contention. Consider moving tensors to CPU before submit or insideoutput_processor(non-blocking when pin_memory=True).Example sketch (guarded, no hard dependency):
try: import torch def _to_cpu(x): if isinstance(x, torch.Tensor): return x.detach().cpu() if isinstance(x, (list, tuple)): return type(x)(_to_cpu(e) for e in x) if isinstance(x, dict): return {k: _to_cpu(v) for k, v in x.items()} return x output = _to_cpu(output) except Exception: passIf you prefer, we can implement this inside the specific model’s
_output_processorto keep core generic.
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (1)
fiftyone/core/models.py(2 hunks)
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-05-02T04:06:17.429Z
Learnt from: jacobsela
PR: voxel51/fiftyone#5761
File: fiftyone/utils/transformers.py:1288-1297
Timestamp: 2025-05-02T04:06:17.429Z
Learning: The `_get_image_size` function in transformers.py is intentionally kept consistent with the implementation in TorchImageModel, and should be refactored as part of a broader effort rather than in isolation.
Applied to files:
fiftyone/core/models.py
🧬 Code graph analysis (1)
fiftyone/core/models.py (2)
fiftyone/core/utils.py (2)
SetAttributes(2718-2751)submit(3174-3177)fiftyone/utils/torch.py (1)
predict_all(868-886)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (13)
- GitHub Check: test / test-python (ubuntu-latest-m, 3.10)
- GitHub Check: test / test-python (ubuntu-latest-m, 3.9)
- GitHub Check: test-windows / test-python (windows-latest, 3.11)
- GitHub Check: test-windows / test-python (windows-latest, 3.12)
- GitHub Check: test-windows / test-python (windows-latest, 3.9)
- GitHub Check: test / test-python (ubuntu-latest-m, 3.11)
- GitHub Check: e2e / test-e2e
- GitHub Check: test / test-python (ubuntu-latest-m, 3.12)
- GitHub Check: test-windows / test-python (windows-latest, 3.10)
- GitHub Check: test / test-app
- GitHub Check: lint / eslint
- GitHub Check: build / build
- GitHub Check: build
| output_processor = getattr( | ||
| model, "_output_processor", lambda output, *_, **__: output | ||
| ) | ||
| context.enter_context(fou.SetAttributes(model, _output_processor=None)) | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Avoid mutating models that don’t define _output_processor; guard and fall back cleanly
Setting a private attribute on all models can change behavior inadvertently. Only disable _output_processor when it exists; otherwise use a no-op postprocessor.
- output_processor = getattr(
- model, "_output_processor", lambda output, *_, **__: output
- )
- context.enter_context(fou.SetAttributes(model, _output_processor=None))
+ output_processor = getattr(model, "_output_processor", None)
+ if output_processor is None:
+ def output_processor(output, *_, **__):
+ return output
+ else:
+ context.enter_context(
+ fou.SetAttributes(model, _output_processor=None)
+ )📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| output_processor = getattr( | |
| model, "_output_processor", lambda output, *_, **__: output | |
| ) | |
| context.enter_context(fou.SetAttributes(model, _output_processor=None)) | |
| output_processor = getattr(model, "_output_processor", None) | |
| if output_processor is None: | |
| def output_processor(output, *_, **__): | |
| return output | |
| else: | |
| context.enter_context( | |
| fou.SetAttributes(model, _output_processor=None) | |
| ) |
🤖 Prompt for AI Agents
In fiftyone/core/models.py around lines 488 to 492, the code unconditionally
sets a private attribute _output_processor on every model which can mutate
models that don’t define it; change this so you only disable the processor when
it actually exists: check hasattr(model, "_output_processor") first, set
output_processor to model._output_processor if present else to a no-op lambda,
and only call context.enter_context(fou.SetAttributes(model,
_output_processor=None)) when the attribute exists so models without the
attribute are left untouched.
The issues are not just the COLLSCAN alone. it's the unnecessary database lookups to resolve samples when what you actually only need is filepath and storage client to resolve media. It's overcomplicating everything in a way that impacts the ability to parallelize work |
|
Here's a less intrusive take that mirrors the technique in #6293 of manipulating the model's |
Resolving filepaths and media is for loading and preprocessing. It's unrelated to this PR. Please limit discussion to the topic at hand. |
Ok this is a way better starting point. To address the points:
Yeah, this isn't guaranteed across the board currently. It is a part of the
First, note that any change we would do (even if limited to I strongly recommend implementing the generic At the very least, |
|
|
||
| def save_batch(sample_batch, labels_batch): | ||
| def save_batch(sample_batch, output, image_sizes): | ||
| labels_batch = output_processor( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it actually more efficient to do this in a thread? How do the size of the output vectors compare to the final predicted objects?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not more efficient, it's more concurrent. Moving this to a background thread allows the output to be processed while the next batch's inputs are being processed and run. I.e.
(not to scale)
gantt
title Async postprocessing
dateFormat X
axisFormat %s
section Main thread
Preprocess :active, preprocess, 0, 1
Predict :active, predict, 1, 2
Preprocess :active, preprocess, 2, 3
Predict :active, predict, 3, 4
section Background thread
Postprocess :active, postprocess, 2, 3
Write :active, write, 3, 4
Postprocess :active, postprocess, 4, 5
Write :active, write, 5, 6
With the test script, output is 4-5x larger (in bytes) than labels_batch, but since this is a thread, memory is shared, so it it shouldn't incur any greater memory usage than the previous version.
|
@exupero apologies if you've already mentioned this somewhere, but do we have data on the performance benefit of offloading Torch -> FO conversion to a worker thread specifically? Post processing involves 3 steps: torch_tensor = torch_model.predict(...)
# step 1: convert torch output to FO format
fo_labels = some_conversion_fcn(torch_tensor)
# step 2: add labels to Sample
sample.add_labels(fo_labels, ...)
# step 3: save edits to DB
sample.save()We can offload steps 2 and 3 to worker threads without any changes to the Step 1 is inside Don't be afraid of adding a new syntax to the |
|
I've created FOEPD-2017 Add post-processing hook to Model interface. Closing this until that work is complete. |



What changes are proposed in this pull request?
This PR updates
_apply_image_model_data_loaderto perform label post-processing on the same background thread as DB writes. To do so, it splits_predict_all's logic into two sub-functions and makes them accessible via the functionspreprocess_and_forward_passandpostprocess(better names welcome), then calls the latter on the background thread.How is this patch tested? If it is not, please explain why.
Ran following script on voxelcompute instance and profiled with the built-in cProfile module.
With these changes, total running time of the script improved from 250s to 200s, almost 90% of which is spent in

_forward_pass:Release Notes
Is this a user-facing change that should be mentioned in the release notes?
notes for FiftyOne users.
What areas of FiftyOne does this PR affect?
fiftyonePython library changesSummary by CodeRabbit
New Features
Refactor