Skip to content

Conversation

@scotts
Copy link
Contributor

@scotts scotts commented Oct 27, 2025

Not ready for merging. I put up the PR to demonstrate and discuss what it will look like for TorchCodec to accept TorchVision transforms for decoder-native transforms.

In #526, I had initially proposed not using TorchVision transforms, and instead coming up with TorchCodec specific versions. @NicolasHug proposed that we accept TorchVision transforms, and that's what I followed up with in my design in #885.

Discussion points 1 and 2 (see inline comments) are awkward, but I actually think this is better than what I initially proposed. My reasons:

  1. We should be able to prevent TorchCodec from relying on TorchVision unless users want to specify decoder-native transforms. And in that case, it feels very natural to say we accept TorchVision transforms, so you gotta install that and import it to specify them. (Putting aside that just about everyone with TorchCodec installed very likely will have TorchVision.)
  2. From the feedback we've gotten, I think folks are already thinking about these transforms in terms of what TorchVision provides.
  3. If we were to go with my original suggestion, and create a TorchCodec-specific user specification API, we'd want to make sure that its semantics match that of TorchVision. That is, if we had torchcodec.transforms.Resize(height=x, width=y), we'd want to make sure its semantics matched torchvision.transforms.v2.Resize(size=(x,y)). In that specific example, we'd want to make sure that both default to bilinear interpolation. Extrapolating that specific example across all transforms we want to support, we'd basically be creating mirror version of what TorchVision has. That seems silly, since it's more for users to understand and more for us to maintain.

Counter reasons that I think are outweighed by above:

  1. It is awkward for TorchCodec to have any dependence on TorchVision.
  2. We will not be able to maintain bit-for-bit compatibility with the TorchVision transforms in all (most?) cases. We can explain this with documentation, but there's still potential for some users being surprised. Accepting the TorchVision transforms does potentially invite this confusion.
  3. We may have to introduce torchcodec.transforms for transforms that FFmpeg supports, TorchVision does not, and yet we still want to expose. In that case, users may be confused to look in that module and find only a subset of what is supported.

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Oct 27, 2025
return "scale=" + std::to_string(outputDims_.width) + ":" +
std::to_string(outputDims_.height) +
":sws_flags=" + toFilterGraphInterpolation(interpolationMode_);
":flags=" + toFilterGraphInterpolation(interpolationMode_);
Copy link
Contributor Author

@scotts scotts Oct 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From the FFmpeg docs:

Libavfilter will automatically insert scale filters where format conversion is required. It is possible to specify swscale flags for those automatically inserted scalers by prepending sws_flags=flags; to the filtergraph description.

Whereas flags is the specific parameter to scale. They end up being semantically equivalent, but it's more clear to use the scale option here.


assert frame_resize.shape == expected_shape
assert frame_ref.shape == expected_shape
assert_frames_equal(frame_resize, frame_ref)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently fails. Still investigating.

dimension_order: Literal["NCHW", "NHWC"] = "NCHW",
num_ffmpeg_threads: int = 1,
device: Optional[Union[str, torch_device]] = "cpu",
transforms: List[Any] = [], # TRANSFORMS TODO: what is the user-facing type?
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussion point 1: If we accept TorchVision transforms, and we want to lazily load TorchVision, what type do we advertise here? We can easily explain that we accept a TorchVision transform in the docs, but what should we put in the type annotation?

Copy link
Contributor

@NicolasHug NicolasHug Oct 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should probably be either Any or nn.Module, which is the base class of all torchvision v2 transforms, and something users are familiar with since this is the core building block of any pytorch model.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, that solves the problem nicely: it can definitely be nn.Module.

else:
raise ValueError(f"Unsupported transform {transform}.")
return ";".join(transform_specs)

Copy link
Contributor Author

@scotts scotts Oct 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussion point 2: This is what we'll have to do with TorchVision transforms at the moment. We'll need special handling for each transform, looking into its internals to get what we need and enforce decoder-native limitations.

In the future, we can change TorchVision transforms to have an API so that we can get what we need in a generic way. But for now, we'll need to do something like this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still undecided on whether we should accept TV transforms or not (ironic, I know), but I think this is totally OK.

And I think we'll need that level of coupling anyway, even if we were to write our own TC transforms. Echoing what you wrote:

If we were to [...] create ]TorchCodec-specific user specification API, we'd want to make sure that its semantics match that of TorchVision. That is, if we had torchcodec.transforms.Resize(height=x, width=y), we'd want to make sure its semantics matched torchvision.transforms.v2.Resize(size=(x,y)). In that specific example, we'd want to make sure that both default to bilinear interpolation. Extrapolating that specific example across all transforms we want to support, we'd basically be creating mirror version of what TorchVision has. That seems silly, since it's more for users to understand and more for us to maintain.

Basically, that coupling between TC and TV will have to exist either in the code (as in this PR), or in our heads as API designers.


Side note, slightly related: if we're going to have our own TC transforms, I think we'll want their API to exactly match (or be a strict subset of) the TV transforms. E.g. we'd have torchcodec.transforms.Resize(size=...) instead of torchcodec.transforms.Resize(height=..., width=...) ?

Copy link
Contributor Author

@scotts scotts Oct 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@NicolasHug, I came to same conclusion as:

Side note, slightly related: if we're going to have our own TC transforms, I think we'll want their API to exactly match (or be a strict subset of) the TV transforms. E.g. we'd have torchcodec.transforms.Resize(size=...) instead of torchcodec.transforms.Resize(height=..., width=...) ?

At which point, I don't think we've really gained anything by having them separate. And users will probably also start asking, hey, can you just accept the TorchVision ones? I also just realized a new counter-point, which I'll put up in the summary as counter point 3.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants