-
Couldn't load subscription status.
- Fork 66
Decoder-native resize public implementation #1003
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
| return "scale=" + std::to_string(outputDims_.width) + ":" + | ||
| std::to_string(outputDims_.height) + | ||
| ":sws_flags=" + toFilterGraphInterpolation(interpolationMode_); | ||
| ":flags=" + toFilterGraphInterpolation(interpolationMode_); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From the FFmpeg docs:
Libavfilter will automatically insert
scalefilters where format conversion is required. It is possible to specify swscale flags for those automatically inserted scalers by prependingsws_flags=flags;to the filtergraph description.
Whereas flags is the specific parameter to scale. They end up being semantically equivalent, but it's more clear to use the scale option here.
|
|
||
| assert frame_resize.shape == expected_shape | ||
| assert frame_ref.shape == expected_shape | ||
| assert_frames_equal(frame_resize, frame_ref) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently fails. Still investigating.
| dimension_order: Literal["NCHW", "NHWC"] = "NCHW", | ||
| num_ffmpeg_threads: int = 1, | ||
| device: Optional[Union[str, torch_device]] = "cpu", | ||
| transforms: List[Any] = [], # TRANSFORMS TODO: what is the user-facing type? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Discussion point 1: If we accept TorchVision transforms, and we want to lazily load TorchVision, what type do we advertise here? We can easily explain that we accept a TorchVision transform in the docs, but what should we put in the type annotation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should probably be either Any or nn.Module, which is the base class of all torchvision v2 transforms, and something users are familiar with since this is the core building block of any pytorch model.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, that solves the problem nicely: it can definitely be nn.Module.
| else: | ||
| raise ValueError(f"Unsupported transform {transform}.") | ||
| return ";".join(transform_specs) | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Discussion point 2: This is what we'll have to do with TorchVision transforms at the moment. We'll need special handling for each transform, looking into its internals to get what we need and enforce decoder-native limitations.
In the future, we can change TorchVision transforms to have an API so that we can get what we need in a generic way. But for now, we'll need to do something like this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm still undecided on whether we should accept TV transforms or not (ironic, I know), but I think this is totally OK.
And I think we'll need that level of coupling anyway, even if we were to write our own TC transforms. Echoing what you wrote:
If we were to [...] create ]TorchCodec-specific user specification API, we'd want to make sure that its semantics match that of TorchVision. That is, if we had torchcodec.transforms.Resize(height=x, width=y), we'd want to make sure its semantics matched torchvision.transforms.v2.Resize(size=(x,y)). In that specific example, we'd want to make sure that both default to bilinear interpolation. Extrapolating that specific example across all transforms we want to support, we'd basically be creating mirror version of what TorchVision has. That seems silly, since it's more for users to understand and more for us to maintain.
Basically, that coupling between TC and TV will have to exist either in the code (as in this PR), or in our heads as API designers.
Side note, slightly related: if we're going to have our own TC transforms, I think we'll want their API to exactly match (or be a strict subset of) the TV transforms. E.g. we'd have torchcodec.transforms.Resize(size=...) instead of torchcodec.transforms.Resize(height=..., width=...) ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@NicolasHug, I came to same conclusion as:
Side note, slightly related: if we're going to have our own TC transforms, I think we'll want their API to exactly match (or be a strict subset of) the TV transforms. E.g. we'd have
torchcodec.transforms.Resize(size=...)instead oftorchcodec.transforms.Resize(height=..., width=...)?
At which point, I don't think we've really gained anything by having them separate. And users will probably also start asking, hey, can you just accept the TorchVision ones? I also just realized a new counter-point, which I'll put up in the summary as counter point 3.
Not ready for merging. I put up the PR to demonstrate and discuss what it will look like for TorchCodec to accept TorchVision transforms for decoder-native transforms.
In #526, I had initially proposed not using TorchVision transforms, and instead coming up with TorchCodec specific versions. @NicolasHug proposed that we accept TorchVision transforms, and that's what I followed up with in my design in #885.
Discussion points 1 and 2 (see inline comments) are awkward, but I actually think this is better than what I initially proposed. My reasons:
torchcodec.transforms.Resize(height=x, width=y), we'd want to make sure its semantics matchedtorchvision.transforms.v2.Resize(size=(x,y)). In that specific example, we'd want to make sure that both default to bilinear interpolation. Extrapolating that specific example across all transforms we want to support, we'd basically be creating mirror version of what TorchVision has. That seems silly, since it's more for users to understand and more for us to maintain.Counter reasons that I think are outweighed by above:
torchcodec.transformsfor transforms that FFmpeg supports, TorchVision does not, and yet we still want to expose. In that case, users may be confused to look in that module and find only a subset of what is supported.