Decoder-native resize public implementation #1003

scotts · 2025-10-27T19:28:27Z

Not ready for merging. I put up the PR to demonstrate and discuss what it will look like for TorchCodec to accept TorchVision transforms for decoder-native transforms.

In #526, I had initially proposed not using TorchVision transforms, and instead coming up with TorchCodec specific versions. @NicolasHug proposed that we accept TorchVision transforms, and that's what I followed up with in my design in #885.

Discussion points 1 and 2 (see inline comments) are awkward, but I actually think this is better than what I initially proposed. My reasons:

We should be able to prevent TorchCodec from relying on TorchVision unless users want to specify decoder-native transforms. And in that case, it feels very natural to say we accept TorchVision transforms, so you gotta install that and import it to specify them. (Putting aside that just about everyone with TorchCodec installed very likely will have TorchVision.)
From the feedback we've gotten, I think folks are already thinking about these transforms in terms of what TorchVision provides.
If we were to go with my original suggestion, and create a TorchCodec-specific user specification API, we'd want to make sure that its semantics match that of TorchVision. That is, if we had torchcodec.transforms.Resize(height=x, width=y), we'd want to make sure its semantics matched torchvision.transforms.v2.Resize(size=(x,y)). In that specific example, we'd want to make sure that both default to bilinear interpolation. Extrapolating that specific example across all transforms we want to support, we'd basically be creating mirror version of what TorchVision has. That seems silly, since it's more for users to understand and more for us to maintain.

Counter reasons that I think are outweighed by above:

It is awkward for TorchCodec to have any dependence on TorchVision.
We will not be able to maintain bit-for-bit compatibility with the TorchVision transforms in all (most?) cases. We can explain this with documentation, but there's still potential for some users being surprised. Accepting the TorchVision transforms does potentially invite this confusion.
We may have to introduce torchcodec.transforms for transforms that FFmpeg supports, TorchVision does not, and yet we still want to expose. In that case, users may be confused to look in that module and find only a subset of what is supported.

scotts · 2025-10-27T19:30:59Z

src/torchcodec/_core/Transform.cpp

  return "scale=" + std::to_string(outputDims_.width) + ":" +
      std::to_string(outputDims_.height) +
-      ":sws_flags=" + toFilterGraphInterpolation(interpolationMode_);
+      ":flags=" + toFilterGraphInterpolation(interpolationMode_);


From the FFmpeg docs:

Libavfilter will automatically insert scale filters where format conversion is required. It is possible to specify swscale flags for those automatically inserted scalers by prepending sws_flags=flags; to the filtergraph description.

Whereas flags is the specific parameter to scale. They end up being semantically equivalent, but it's more clear to use the scale option here.

scotts · 2025-10-27T19:31:51Z

test/test_transform_ops.py

+
+            assert frame_resize.shape == expected_shape
+            assert frame_ref.shape == expected_shape
+            assert_frames_equal(frame_resize, frame_ref)


Currently fails. Still investigating.

scotts · 2025-10-27T23:53:49Z

src/torchcodec/decoders/_video_decoder.py

        dimension_order: Literal["NCHW", "NHWC"] = "NCHW",
        num_ffmpeg_threads: int = 1,
        device: Optional[Union[str, torch_device]] = "cpu",
+        transforms: List[Any] = [],  # TRANSFORMS TODO: what is the user-facing type?


Discussion point 1: If we accept TorchVision transforms, and we want to lazily load TorchVision, what type do we advertise here? We can easily explain that we accept a TorchVision transform in the docs, but what should we put in the type annotation?

It should probably be either Any or nn.Module, which is the base class of all torchvision v2 transforms, and something users are familiar with since this is the core building block of any pytorch model.

Oh, that solves the problem nicely: it can definitely be nn.Module.

scotts · 2025-10-27T23:54:46Z

src/torchcodec/decoders/_video_decoder.py

+        else:
+            raise ValueError(f"Unsupported transform {transform}.")
+    return ";".join(transform_specs)
+


Discussion point 2: This is what we'll have to do with TorchVision transforms at the moment. We'll need special handling for each transform, looking into its internals to get what we need and enforce decoder-native limitations.

In the future, we can change TorchVision transforms to have an API so that we can get what we need in a generic way. But for now, we'll need to do something like this.

I'm still undecided on whether we should accept TV transforms or not (ironic, I know), but I think this is totally OK.

And I think we'll need that level of coupling anyway, even if we were to write our own TC transforms. Echoing what you wrote:

If we were to [...] create ]TorchCodec-specific user specification API, we'd want to make sure that its semantics match that of TorchVision. That is, if we had torchcodec.transforms.Resize(height=x, width=y), we'd want to make sure its semantics matched torchvision.transforms.v2.Resize(size=(x,y)). In that specific example, we'd want to make sure that both default to bilinear interpolation. Extrapolating that specific example across all transforms we want to support, we'd basically be creating mirror version of what TorchVision has. That seems silly, since it's more for users to understand and more for us to maintain.

Basically, that coupling between TC and TV will have to exist either in the code (as in this PR), or in our heads as API designers.

Side note, slightly related: if we're going to have our own TC transforms, I think we'll want their API to exactly match (or be a strict subset of) the TV transforms. E.g. we'd have torchcodec.transforms.Resize(size=...) instead of torchcodec.transforms.Resize(height=..., width=...) ?

@NicolasHug, I came to same conclusion as:

Side note, slightly related: if we're going to have our own TC transforms, I think we'll want their API to exactly match (or be a strict subset of) the TV transforms. E.g. we'd have torchcodec.transforms.Resize(size=...) instead of torchcodec.transforms.Resize(height=..., width=...) ?

At which point, I don't think we've really gained anything by having them separate. And users will probably also start asking, hey, can you just accept the TorchVision ones? I also just realized a new counter-point, which I'll put up in the summary as counter point 3.

Decoder-native resize public implementation

dd24dfa

meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Oct 27, 2025

scotts commented Oct 27, 2025

View reviewed changes

Lint

3a2df84

scotts commented Oct 27, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Decoder-native resize public implementation #1003

Decoder-native resize public implementation #1003

Uh oh!

scotts commented Oct 27, 2025 •

edited

Loading

Uh oh!

scotts Oct 27, 2025 •

edited

Loading

Uh oh!

scotts Oct 27, 2025

Uh oh!

scotts Oct 27, 2025

Uh oh!

NicolasHug Oct 28, 2025 •

edited

Loading

Uh oh!

scotts Oct 28, 2025

Uh oh!

scotts Oct 27, 2025 •

edited

Loading

Uh oh!

NicolasHug Oct 28, 2025

Uh oh!

scotts Oct 28, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Decoder-native resize public implementation #1003

Are you sure you want to change the base?

Decoder-native resize public implementation #1003

Uh oh!

Conversation

scotts commented Oct 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

scotts Oct 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

scotts Oct 27, 2025

Choose a reason for hiding this comment

Uh oh!

scotts Oct 27, 2025

Choose a reason for hiding this comment

Uh oh!

NicolasHug Oct 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

scotts Oct 28, 2025

Choose a reason for hiding this comment

Uh oh!

scotts Oct 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

NicolasHug Oct 28, 2025

Choose a reason for hiding this comment

Uh oh!

scotts Oct 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

scotts commented Oct 27, 2025 •

edited

Loading

scotts Oct 27, 2025 •

edited

Loading

NicolasHug Oct 28, 2025 •

edited

Loading

scotts Oct 27, 2025 •

edited

Loading

scotts Oct 28, 2025 •

edited

Loading