Skip to content

Conversation

@exupero
Copy link
Contributor

@exupero exupero commented Aug 28, 2025

What changes are proposed in this pull request?

This PR updates _apply_image_model_data_loader to perform label post-processing on the same background thread as DB writes. To do so, it splits _predict_all's logic into two sub-functions and makes them accessible via the functions preprocess_and_forward_pass and postprocess (better names welcome), then calls the latter on the background thread.

How is this patch tested? If it is not, please explain why.

Ran following script on voxelcompute instance and profiled with the built-in cProfile module.

import sys

import fiftyone as fo
import fiftyone.zoo as foz

if __name__ == "__main__":
    suffix = sys.argv[1] if len(sys.argv) > 1 else "baseline"
    dataset = foz.load_zoo_dataset("coco-2017", split="validation")
    if not dataset.persistent:
        dataset.persistent = True
    classes = dataset.distinct("ground_truth.detections.label")
    model = foz.load_zoo_model("omdet-turbo-swin-tiny-torch", classes=classes)
    dataset.apply_model(model, num_workers=32, batch_size=16, confidence_thresh=0.1, label_field=f"profile_{suffix}", skip_failures=False)

With these changes, total running time of the script improved from 250s to 200s, almost 90% of which is spent in _forward_pass:
postprocessing

Release Notes

Is this a user-facing change that should be mentioned in the release notes?

  • No. You can skip the rest of this section.
  • Yes. Give a description of this change to be included in the release
    notes for FiftyOne users.

What areas of FiftyOne does this PR affect?

  • App: FiftyOne application changes
  • Build: Build and test infrastructure changes
  • Core: Core fiftyone Python library changes
  • Documentation: FiftyOne documentation changes
  • Other

Summary by CodeRabbit

  • New Features

    • Added support for post-processing model outputs into final labels, enabling more accurate and consistent results.
    • Post-processing now leverages image size metadata when available.
    • Confidence thresholds are applied during post-processing for cleaner predictions.
  • Refactor

    • Shifted to a two-stage prediction pipeline (raw output + post-processing) for both batch and asynchronous paths, improving reliability and consistency without changing public APIs.

@exupero exupero requested a review from a team as a code owner August 28, 2025 20:55
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Aug 28, 2025

Walkthrough

Refactors image model data loader prediction to a two-stage pipeline: compute raw outputs via model.predict_all, then apply a postprocessing hook (output_processor) using image sizes. Updates async submission and save_batch signatures to handle raw outputs plus image_sizes, and temporarily disables model._output_processor during the forward pass.

Changes

Cohort / File(s) Summary of Changes
Two-stage prediction and postprocessing pipeline
fiftyone/core/models.py
Replaces direct label prediction with: (1) forward pass to obtain raw output via predict_all; (2) postprocess via output_processor with image_sizes and confidence threshold. Updates async submit/save_batch signatures to pass output and image_sizes. Extracts fo_image_size from loader output with fallback. Temporarily unsets model._output_processor during inference.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  actor User
  participant DataLoader as DataLoader
  participant Model as Model
  participant OutputProc as output_processor
  participant Saver as save_batch / submit

  User->>DataLoader: Iterate batches
  DataLoader->>Model: imgs (+ samples if needed)<br/>fo_image_size (optional)
  rect rgba(200,220,255,0.3)
    note right of Model: Temporarily disable model._output_processor
    Model-->>DataLoader: output = predict_all(imgs[, samples])
  end

  DataLoader->>OutputProc: output, image_sizes,<br/>confidence_thresh
  OutputProc-->>DataLoader: labels_batch

  alt synchronous save
    DataLoader->>Saver: save_batch(sample_batch, output, image_sizes)
    Saver->>OutputProc: Postprocess to labels
    Saver-->>User: Saved labels
  else async submit
    DataLoader->>Saver: submit(save_batch, sample_batch,<br/>output, image_sizes)
    Saver-->>User: Scheduled
  end

  note over OutputProc,Saver: image_sizes defaults to [(None, None)] if missing
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

I thump with glee at the two-step hop,
First raw carrots, then a tidy chop—
Outputs gleaned, sizes in tow,
Postprocess whispers, labels now glow.
Async burrow queues the stash,
Batch by batch—we’re swift as a flash! 🥕🐇

✨ Finishing Touches
  • 📝 Generate Docstrings
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch task/ess/post-processing-async-FOEPD-1977

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR/Issue comments)

Type @coderabbitai help to get the list of available commands.

Other keywords and placeholders

  • Add @coderabbitai ignore or @coderabbit ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

Status, Documentation and Community

  • Visit our Status Page to check the current availability of CodeRabbit.
  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (3)
fiftyone/utils/torch.py (2)

868-874: Add parameter documentation for the new public methods.

The new public methods lack docstrings. Consider adding documentation to clarify their purpose and parameters, especially image_sizes which appears to be a list of tuples.

Add docstrings like:

def preprocess_and_forward_pass(self, imgs):
    """Performs preprocessing and forward pass on the given images.
    
    Args:
        imgs: the batch of images to process
    
    Returns:
        the raw model output
    """
    return self._preprocess_and_forward_pass(imgs)

876-880: Add parameter documentation for postprocess method.

Similar to the above, the postprocess method would benefit from documentation.

Add docstring like:

def postprocess(self, output, image_sizes):
    """Postprocesses the raw model output.
    
    Args:
        output: the raw model output from preprocess_and_forward_pass
        image_sizes: a list of (width, height) tuples for each image
    
    Returns:
        the processed predictions
    """
    return self._postprocess(output, image_sizes)
fiftyone/core/models.py (1)

511-511: Consider validating the extracted image_sizes.

While the fallback to [(None, None)] provides backward compatibility, consider adding validation to ensure the extracted image_sizes list has the expected length matching the batch size.

Consider adding validation after line 511:

image_sizes = imgs.pop("fo_image_size", [(None, None)])
if image_sizes and len(image_sizes) != len(sample_batch):
    logger.warning(
        "Image sizes count (%d) doesn't match batch size (%d)",
        len(image_sizes), len(sample_batch)
    )
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 2853b72 and adccc1e.

📒 Files selected for processing (3)
  • fiftyone/core/models.py (2 hunks)
  • fiftyone/utils/torch.py (1 hunks)
  • fiftyone/utils/transformers.py (2 hunks)
🧰 Additional context used
🧠 Learnings (1)
📓 Common learnings
Learnt from: jacobsela
PR: voxel51/fiftyone#5761
File: fiftyone/utils/transformers.py:1288-1297
Timestamp: 2025-05-02T04:06:17.429Z
Learning: The `_get_image_size` function in transformers.py is intentionally kept consistent with the implementation in TorchImageModel, and should be refactored as part of a broader effort rather than in isolation.
🧬 Code graph analysis (3)
fiftyone/utils/torch.py (1)
fiftyone/utils/transformers.py (4)
  • _preprocess_and_forward_pass (657-666)
  • _preprocess_and_forward_pass (950-951)
  • _postprocess (668-676)
  • _postprocess (953-954)
fiftyone/core/models.py (2)
fiftyone/utils/torch.py (2)
  • postprocess (876-877)
  • preprocess_and_forward_pass (868-869)
fiftyone/core/utils.py (1)
  • submit (3174-3177)
fiftyone/utils/transformers.py (2)
fiftyone/utils/torch.py (10)
  • _preprocess_and_forward_pass (871-874)
  • preprocess (800-802)
  • preprocess (805-806)
  • collate_fn (777-797)
  • transforms (761-765)
  • device (814-816)
  • _forward_pass (951-952)
  • _postprocess (879-880)
  • _predict_all (390-400)
  • _predict_all (902-949)
fiftyone/utils/ultralytics.py (3)
  • collate_fn (444-453)
  • _forward_pass (529-531)
  • _predict_all (588-621)
🪛 Pylint (3.3.7)
fiftyone/utils/transformers.py

[error] 659-659: self.transforms is not callable

(E1102)


[refactor] 669-676: Unnecessary "else" after "return", remove the "else" and de-indent the code inside it

(R1705)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (12)
  • GitHub Check: test-windows / test-python (windows-latest, 3.11)
  • GitHub Check: test-windows / test-python (windows-latest, 3.9)
  • GitHub Check: test-windows / test-python (windows-latest, 3.10)
  • GitHub Check: test-windows / test-python (windows-latest, 3.12)
  • GitHub Check: test / test-python (ubuntu-latest-m, 3.11)
  • GitHub Check: test / test-python (ubuntu-latest-m, 3.12)
  • GitHub Check: test / test-python (ubuntu-latest-m, 3.9)
  • GitHub Check: test / test-app
  • GitHub Check: test / test-python (ubuntu-latest-m, 3.10)
  • GitHub Check: build / build
  • GitHub Check: e2e / test-e2e
  • GitHub Check: build
🔇 Additional comments (6)
fiftyone/utils/transformers.py (4)

657-666: LGTM! Clean refactoring of the prediction pipeline.

The new _preprocess_and_forward_pass method effectively consolidates preprocessing and forward pass logic into a single method, maintaining consistency with the two-stage pipeline architecture introduced across the codebase.


668-676: LGTM! Well-structured postprocessing method.

The _postprocess method cleanly handles both cases - with and without an output processor. The implementation is consistent with the interface defined in fiftyone/utils/torch.py.


678-681: LGTM! Proper orchestration of the three-stage pipeline.

The updated _predict_all method correctly implements the new pipeline: extract image sizes, run preprocessing/forward pass, then postprocess with sizes. The fallback to [(None, None)] ensures backward compatibility.


950-954: LGTM! Correct delegation pattern for zero-shot models.

The wrapper methods properly delegate to the parent class implementation, maintaining consistency across the transformer hierarchy.

fiftyone/core/models.py (2)

489-491: LGTM! Efficient asynchronous postprocessing implementation.

The refactored save_batch function correctly receives the raw output and image sizes, then performs postprocessing on the background thread as intended. This change successfully moves the label post-processing work off the main thread.


511-519: LGTM! Proper implementation of the two-stage pipeline.

The changes correctly extract image sizes, call preprocess_and_forward_pass to get raw output, and submit both to the async worker for postprocessing. The fallback to [(None, None)] ensures backward compatibility when fo_image_size is not provided.

@jacobsela
Copy link
Contributor

@exupero Let's wait with this PR for a bit, I feel like the last one was rushed. This one is even more sensitive because of the model class changes.

Additionally, if we are already changing the model interface, let's move it closer to the final goal of the generic TorchModel. Doc for that is shared with you.

@jacobsela
Copy link
Contributor

jacobsela commented Aug 29, 2025

I was thinking about the numbers for a bit and I think most of the time save in this PR is coming from saving on I/O time between the CPU and GPU.

When comparing with the mock PR I put up a few days ago, the diff in computation is as follows:

  1. post processing is threaded in this PR rather than on a different process.
  2. however the GPU-CPU I/O is on the main process in the other PR, while it's on a different thread in this one.

Those two points combined with the fact that the GPU/CPU I/O in the mock PR was ~50s, roughly the amount of time saved in this PR, lead me to believe that the above assertion is correct.

Now, I didn't offload the I/O in the other PR because I was getting a GPU memory leak (that I didn't want to deal with for the mockup) when doing the GPU-CPU I/O call in the other process, see the change in this line.

However what this leads me to believe is that we could get the combined effects of both PRs by having the "actual" post processing be on another process while having the I/O be threaded. Assuming we can stack the effects, we would be looking at ~100s for this code (going off the mock PR profiling).

@kaixi-wang
Copy link
Contributor

"actual" post processing be on another process while having the I/O be threaded

This is pretty much what I had shown in my poc. However to offload to a separate process and have any performance benefit, you must not be working with samples. You need to directly extract the update from the model outputs and save them to have any benefit and avoid unnecessary requesting due to samples and database connections not being serializable.

Copy link
Contributor

@brimoor brimoor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we wanted to provide a new syntax where postprocessing is not performed in the main process, then we'd need to support it in a backwards-compatible way in the Model class that would allow apply_model() to continue working for all existing Model instances.

I believe what you want is a similar idea to the existing Model.preprocess == False, which tells Model.transforms to be applied by the data loader and not in predict()/predict_all().

For example, we could have Model.postprocess == None by default, but if a Model provides a non-None Model.postprocess, then predict()/predict_all() is assumed to provide a "raw" output that must be passed to Model.postprocess(). And then that can be called in a separate worker inside apply_model().

if isinstance(imgs, Exception):
raise imgs

image_sizes = imgs.pop("fo_image_size", [(None, None)])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fo_image_size is unique to FiftyOneTransformer so it can't be referenced in fiftyone.core.models.

@exupero exupero changed the title FOEPD-1977: Move label post-processing to background thread [DON'T MERGE] FOEPD-1977: Move label post-processing to background thread Aug 29, 2025
@exupero
Copy link
Contributor Author

exupero commented Aug 29, 2025

I should have noted that this is a proof of concept needing feedback for the proper way to apply it, so thanks for providing that feedback without my asking 💯.

@jacobsela can you provide benchmarks for reloading samples on a separate process? I spiked a branch with that mimicked the serialization and deserialization logic in your #6293, but having to reload samples from the DB appeared to be significantly slower than sharing memory between threads (slower as in my test script appeared to hang but was actually just going through background tasks at 15-20s per task). I might have missed something in your proof of concept, so I can share my spike code if you're interested.

My expectation is that the speedup in this branch is due to unblocking the GPU. Currently predict_all blocks on GPU-bound execution, then blocks on CPU-bound execution, but those can be done concurrently, so moving CPU operations to a background thread allows the next batch of GPU operations to begin while the CPU is still transforming tensors into FO detection objects.

Does Python cProfile data capture GPU-CPU I/O? If so, what should I look for? Or are you using a different tool to benchmark GPU-CPU I/O?

@exupero
Copy link
Contributor Author

exupero commented Aug 29, 2025

@jacobsela I see <method 'to' of 'torch._C.TensorBase' objects>, is that indicative of the GPU-CPU copying you're talking about?
Screenshot 2025-08-29 at 11 21 25 AM

@jacobsela
Copy link
Contributor

@exupero

I see <method 'to' of 'torch._C.TensorBase' objects>, is that indicative of the GPU-CPU copying you're talking about?

Yes. For the code in this PR the CPU to GPU copy happens in the preprocess_and_forward_pass (in the base code it's in _predict_all). That's what is shown in the figure you attached, line is here. The CPU to GPU copy happens on the offloaded thread in this PR. In the mock PR I sent it's on the main process. I made a mistake in my earlier assessment, not all of the 50 second save in this PR is from offloading this part, but rather roughly ~25s (just the GPU to CPU copy).

can you provide benchmarks for reloading samples on a separate process?

Sure, I can share something basic, I'll ping you when it's ready. In the meantime note the profiling in the mock PR. This isn't a bottleneck as far as I can tell in our toy benchmark.

I spiked a branch with that mimicked the serialization and deserialization logic in your #6293, but having to reload samples from the DB appeared to be significantly slower than sharing memory between threads (slower as in my test script appeared to hang but was actually just going through background tasks at 15-20s per task).

Were you serializing/deserializing every sample batch at every iteration? You'll see in the PR I sent that each worker gets a serialized SamplesCollection/DatasetView containing all of the samples of interest once at initialization. From that point on, the message each worker gets is of the form (batch_ids, labels). The worker is responsible for fetching the samples from the DB (by IDs), modifying them, and writing them back. (Also note that there are a ton more optimizations that can be added to the code I shared, it's just an example).

I believe the specific part of

fetching the samples from the DB (by IDs)

bothers @kaixi-wang about this approach. As far as I understand the concern stems from the repeated COLLSCANs that can be potentially slow, especially at the bottom of the collection. I'm not sure that in aggregate this is a problem just yet, but I have some ideas for things we can do to alleviate some of the problem if it turns out that it is. As a starting point, if either of you (or @brimoor) know if the default index on the id field is a hash index that would be helpful information.

moving CPU operations to a background thread allows the next batch of GPU operations to begin while the CPU is still transforming tensors into FO detection objects.

That may be happening to an extent, but I think if this was really the big part here we would see a sharper drop in total time. Let's talk about the profiling in the multiprocessing mock PR when you have a chance.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
fiftyone/core/models.py (1)

519-524: Do not assume dict batches or reference transformer-specific keys; also fix fallback length

  • Calling imgs.pop(...) will raise on non-dict batches (e.g., tensors/lists).
  • Referencing "fo_image_size" in core couples to a Transformers-only detail and has been flagged as undesirable.
  • The fallback [(None, None)] length mismatches multi-sample batches.

Use a non-mutating, generic extraction with a robust fallback derived from sample metadata.

-                image_sizes = imgs.pop("fo_image_size", [(None, None)])
+                # Best-effort: use model-provided sizes if present; otherwise fall back to sample metadata
+                if isinstance(imgs, dict):
+                    image_sizes = imgs.get("fo_image_size")
+                else:
+                    image_sizes = None
+                if image_sizes is None:
+                    image_sizes = [
+                        (
+                            getattr(getattr(s, "metadata", None), "width", None),
+                            getattr(getattr(s, "metadata", None), "height", None),
+                        )
+                        for s in sample_batch
+                    ]

Optional (preferred, future-proof): add a private hook on models that need special handling and remove any direct reference to "fo_image_size" from core:

-                if isinstance(imgs, dict):
-                    image_sizes = imgs.get("fo_image_size")
+                sizes_extractor = getattr(model, "_get_image_sizes_from_batch", None)
+                image_sizes = sizes_extractor(imgs, sample_batch) if sizes_extractor else None

This aligns with prior guidance to keep image-size logic consistent with TorchImageModel and avoid hardcoding dataset keys.

🧹 Nitpick comments (2)
fiftyone/core/models.py (2)

493-499: Background-thread postprocessing looks good; verify thresholding isn’t applied twice

confidence_thresh is applied inside output_processor(...) and again via sample.add_labels(..., confidence_thresh=...). Ensure only one layer performs filtering to avoid unexpected drops near the boundary.

If postprocessing already filters, consider passing confidence_thresh=None to add_labels, or omitting the param there. I can draft a follow-up patch once you confirm the intended ownership of thresholding.


525-527: Avoid holding GPU tensors across the handoff to the background thread

If output contains GPU tensors, submitting them to the executor can keep GPU memory alive until postprocessing completes, increasing peak memory and contention. Consider moving tensors to CPU before submit or inside output_processor (non-blocking when pin_memory=True).

Example sketch (guarded, no hard dependency):

try:
    import torch
    def _to_cpu(x):
        if isinstance(x, torch.Tensor):
            return x.detach().cpu()
        if isinstance(x, (list, tuple)):
            return type(x)(_to_cpu(e) for e in x)
        if isinstance(x, dict):
            return {k: _to_cpu(v) for k, v in x.items()}
        return x
    output = _to_cpu(output)
except Exception:
    pass

If you prefer, we can implement this inside the specific model’s _output_processor to keep core generic.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between adccc1e and d0cf8c1.

📒 Files selected for processing (1)
  • fiftyone/core/models.py (2 hunks)
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-05-02T04:06:17.429Z
Learnt from: jacobsela
PR: voxel51/fiftyone#5761
File: fiftyone/utils/transformers.py:1288-1297
Timestamp: 2025-05-02T04:06:17.429Z
Learning: The `_get_image_size` function in transformers.py is intentionally kept consistent with the implementation in TorchImageModel, and should be refactored as part of a broader effort rather than in isolation.

Applied to files:

  • fiftyone/core/models.py
🧬 Code graph analysis (1)
fiftyone/core/models.py (2)
fiftyone/core/utils.py (2)
  • SetAttributes (2718-2751)
  • submit (3174-3177)
fiftyone/utils/torch.py (1)
  • predict_all (868-886)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (13)
  • GitHub Check: test / test-python (ubuntu-latest-m, 3.10)
  • GitHub Check: test / test-python (ubuntu-latest-m, 3.9)
  • GitHub Check: test-windows / test-python (windows-latest, 3.11)
  • GitHub Check: test-windows / test-python (windows-latest, 3.12)
  • GitHub Check: test-windows / test-python (windows-latest, 3.9)
  • GitHub Check: test / test-python (ubuntu-latest-m, 3.11)
  • GitHub Check: e2e / test-e2e
  • GitHub Check: test / test-python (ubuntu-latest-m, 3.12)
  • GitHub Check: test-windows / test-python (windows-latest, 3.10)
  • GitHub Check: test / test-app
  • GitHub Check: lint / eslint
  • GitHub Check: build / build
  • GitHub Check: build

Comment on lines +488 to 492
output_processor = getattr(
model, "_output_processor", lambda output, *_, **__: output
)
context.enter_context(fou.SetAttributes(model, _output_processor=None))

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Avoid mutating models that don’t define _output_processor; guard and fall back cleanly

Setting a private attribute on all models can change behavior inadvertently. Only disable _output_processor when it exists; otherwise use a no-op postprocessor.

-        output_processor = getattr(
-            model, "_output_processor", lambda output, *_, **__: output
-        )
-        context.enter_context(fou.SetAttributes(model, _output_processor=None))
+        output_processor = getattr(model, "_output_processor", None)
+        if output_processor is None:
+            def output_processor(output, *_, **__):
+                return output
+        else:
+            context.enter_context(
+                fou.SetAttributes(model, _output_processor=None)
+            )
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
output_processor = getattr(
model, "_output_processor", lambda output, *_, **__: output
)
context.enter_context(fou.SetAttributes(model, _output_processor=None))
output_processor = getattr(model, "_output_processor", None)
if output_processor is None:
def output_processor(output, *_, **__):
return output
else:
context.enter_context(
fou.SetAttributes(model, _output_processor=None)
)
🤖 Prompt for AI Agents
In fiftyone/core/models.py around lines 488 to 492, the code unconditionally
sets a private attribute _output_processor on every model which can mutate
models that don’t define it; change this so you only disable the processor when
it actually exists: check hasattr(model, "_output_processor") first, set
output_processor to model._output_processor if present else to a no-op lambda,
and only call context.enter_context(fou.SetAttributes(model,
_output_processor=None)) when the attribute exists so models without the
attribute are left untouched.

@kaixi-wang
Copy link
Contributor

kaixi-wang commented Sep 2, 2025

bothers @kaixi-wang about this approach. As far as I understand the concern stems from the repeated COLLSCANs that can be potentially slow, especially at the bottom of the collection. I'm not sure that in aggregate this is a problem just yet, but I have some ideas for things we can do to alleviate some of the problem if it turns out that it is

The issues are not just the COLLSCAN alone. it's the unnecessary database lookups to resolve samples when what you actually only need is filepath and storage client to resolve media. It's overcomplicating everything in a way that impacts the ability to parallelize work

@exupero
Copy link
Contributor Author

exupero commented Sep 2, 2025

Here's a less intrusive take that mirrors the technique in #6293 of manipulating the model's _output_processor attribute to prevent post-prediction processing. It works, but I don't see that attribute as part of the model interface, so I'm assuming model classes aren't guaranteed to have it and this approach is a non-starter. Is there a more universal way to exclude post-processing from predict_all, or would doing that necessitate changing the model contract?

@jacobsela
Copy link
Contributor

The issues are not just the COLLSCAN alone. it's the unnecessary database lookups to resolve samples when what you actually only need is filepath and storage client to resolve media. It's overcomplicating everything in a way that impacts the ability to parallelize work

Resolving filepaths and media is for loading and preprocessing. It's unrelated to this PR. Please limit discussion to the topic at hand.

@jacobsela
Copy link
Contributor

jacobsela commented Sep 2, 2025

Here's a less intrusive take that mirrors the technique in #6293 of manipulating the model's _output_processor attribute to prevent post-prediction processing.

Ok this is a way better starting point. To address the points:

It works, but I don't see that attribute as part of the model interface, so I'm assuming model classes aren't guaranteed to have it and this approach is a non-starter.

Yeah, this isn't guaranteed across the board currently. It is a part of the TorchImageModel interface, see code here. Notice that the apply_model "sub function" you are working with (_apply_image_model_data_loader) only gets called for torch models or for SupportsGetItem (see the check here).

Is there a more universal way to exclude post-processing from predict_all, or would doing that necessitate changing the model contract?

First, note that any change we would do (even if limited to TorchImageModels and _apply_image_model_data_loader) would require some interface change (even if only to those classes), see @brimoor's point.

I strongly recommend implementing the generic TorchModel interface for this change (again see the "arbitrary model input" doc shared with you). This is because the alternative (written below) is not significantly less work.

At the very least, TorchImageModel's _predict_all needs to be changed so that if post processing happens outside of it, it delivers the correct payload. This would have cascading effects TorchImageModel subclasses, and the change would likely also have to be reflected there.


def save_batch(sample_batch, labels_batch):
def save_batch(sample_batch, output, image_sizes):
labels_batch = output_processor(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it actually more efficient to do this in a thread? How do the size of the output vectors compare to the final predicted objects?

Copy link
Contributor Author

@exupero exupero Sep 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not more efficient, it's more concurrent. Moving this to a background thread allows the output to be processed while the next batch's inputs are being processed and run. I.e.

(not to scale)

gantt
	title Async postprocessing
	dateFormat X
	axisFormat %s
	
	section Main thread
	Preprocess :active, preprocess, 0, 1
	Predict :active, predict, 1, 2
	Preprocess :active, preprocess, 2, 3
	Predict :active, predict, 3, 4
	
	section Background thread
	Postprocess :active, postprocess, 2, 3
	Write :active, write, 3, 4
	Postprocess :active, postprocess, 4, 5
	Write :active, write, 5, 6
Loading

With the test script, output is 4-5x larger (in bytes) than labels_batch, but since this is a thread, memory is shared, so it it shouldn't incur any greater memory usage than the previous version.

@brimoor
Copy link
Contributor

brimoor commented Sep 3, 2025

@exupero apologies if you've already mentioned this somewhere, but do we have data on the performance benefit of offloading Torch -> FO conversion to a worker thread specifically?

Post processing involves 3 steps:

torch_tensor = torch_model.predict(...)

# step 1: convert torch output to FO format
fo_labels = some_conversion_fcn(torch_tensor)

# step 2: add labels to Sample
sample.add_labels(fo_labels, ...)

# step 3: save edits to DB
sample.save()

We can offload steps 2 and 3 to worker threads without any changes to the Model interface. And it definitely makes sense to me that we'd get performance improvements for doing that.

Step 1 is inside Model.predict() though, so offloading that to worker threads would require a Model interface change. But my intuition is that step 1 ought to be the fastest of 1-3, and offloading only 2-3 should get us most of the performance improvement.

Don't be afraid of adding a new syntax to the Model interface! Something like what I described here is perfectly fine with me, so long as it indeed offers a significant performance boost. The only requirement is that the interface change is backwards compatible, which we can achieve here.

@exupero
Copy link
Contributor Author

exupero commented Sep 3, 2025

Using the test script in the PR description, this flamegraph against develop shows more than half of _predict_all's time is spent in the output processor (121s out of 219s):
Screenshot 2025-09-03 at 10 29 20 AM
However, the overall improvement of moving post-processing to the background thread only seems to be about 20% (note the green line, where run time reduced from 250s to 200s):
postprocessing
The flamegraph against this branch shows _forward_pass takes ~40% longer:
Screenshot 2025-09-03 at 10 39 15 AM
I expect the longer execution time is due to the GIL contention, so while moving post-processing to a background thread does make an overall improvement, further gains should be possible by moving to a background process (though that work is left for a separate PR).

@exupero
Copy link
Contributor Author

exupero commented Sep 3, 2025

I've created FOEPD-2017 Add post-processing hook to Model interface. Closing this until that work is complete.

@exupero exupero closed this Sep 3, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants