Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
27 commits
Select commit Hold shift + click to select a range
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .github/workflows/lint.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -62,7 +62,7 @@ jobs:
run: python -m pip install --upgrade pip
- name: Install dependencies and FFmpeg
run: |
python -m pip install --pre torch --index-url https://download.pytorch.org/whl/nightly/cpu
python -m pip install --pre torch torchvision --index-url https://download.pytorch.org/whl/nightly/cpu
conda install "ffmpeg=7.0.1" pkg-config pybind11 -c conda-forge
ffmpeg -version
- name: Build and install torchcodec
Expand Down
17 changes: 17 additions & 0 deletions docs/source/api_ref_transforms.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
.. _samplers:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
.. _samplers:
.. _transforms:


===================
torchcodec.transforms
===================
Comment on lines +3 to +5
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're getting warnings when the length don't match

Suggested change
===================
torchcodec.transforms
===================
=====================
torchcodec.transforms
=====================


.. currentmodule:: torchcodec.transforms

For a tutorial, see: TODO_DECODER_TRANSFORMS_TUTORIAL.

.. autosummary::
:toctree: generated/
:nosignatures:
:template: dataclass.rst

DecoderTransform
Resize
1 change: 1 addition & 0 deletions docs/source/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -125,3 +125,4 @@ Encoding
api_ref_decoders
api_ref_encoders
api_ref_samplers
api_ref_transforms
1 change: 1 addition & 0 deletions mypy.ini
Original file line number Diff line number Diff line change
Expand Up @@ -4,3 +4,4 @@ files = src/torchcodec
show_error_codes = True
pretty = True
allow_redefinition = True
follow_untyped_imports = True
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 changes: 1 addition & 1 deletion src/torchcodec/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@
# Note: usort wants to put Frame and FrameBatch after decoders and samplers,
# but that results in circular import.
from ._frame import AudioSamples, Frame, FrameBatch # usort:skip # noqa
from . import decoders, encoders, samplers # noqa
from . import decoders, encoders, samplers, transforms # noqa

try:
# Note that version.py is generated during install.
Expand Down
88 changes: 86 additions & 2 deletions src/torchcodec/decoders/_video_decoder.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,17 +8,18 @@
import json
import numbers
from pathlib import Path
from typing import Literal, Optional, Tuple, Union
from typing import List, Literal, Optional, Sequence, Tuple, Union

import torch
from torch import device as torch_device, Tensor
from torch import device as torch_device, nn, Tensor

from torchcodec import _core as core, Frame, FrameBatch
from torchcodec.decoders._decoder_utils import (
_get_cuda_backend,
create_decoder,
ERROR_REPORTING_INSTRUCTIONS,
)
from torchcodec.transforms import DecoderTransform, Resize


class VideoDecoder:
Expand Down Expand Up @@ -66,6 +67,13 @@ class VideoDecoder:
probably is. Default: "exact".
Read more about this parameter in:
:ref:`sphx_glr_generated_examples_decoding_approximate_mode.py`
transforms (sequence of transform objects, optional): Sequence of transforms to be
applied to the decoded frames by the decoder itself, in order. Accepts both
:class:`~torchcodec.transforms.DecoderTransform` and
`torchvision.transforms.v2.Transform <https://docs.pytorch.org/vision/stable/transforms.html#v2-api-reference-recommended>`_
objects. All transforms are applied
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This and all other references to TorchVision transforms use hard links. I don't think we can get proper Sphinx references when it's in a different project.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can, we'll just need to add a torchvision entry here:

intersphinx_mapping = {
"python": ("https://docs.python.org/3/", None),
"torch": ("https://pytorch.org/docs/stable/", None),
"numpy": ("https://numpy.org/doc/stable/", None),
"PIL": ("https://pillow.readthedocs.io/en/stable/", None),
"matplotlib": ("https://matplotlib.org/stable/", None),
}

Feel free to leave that as follow-up / open an issue.

in the ouput pixel format and colorspace. Read more about this parameter in:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Following up on #1003 (comment)

Do we want to document that we consider passing untransformed frames to TorchVision transforms as our reference? I think we do, because I think that's implied by accepting the TorchVision transforms, and it's a easy way to explain the feature to users.

Agreed, we should document and claim that TV is our ref. I think we have slightly different understandings of what we mean by "TV is our ref", your definition is slightly stricter than mine (see below).

Is when the transform is applied useful to users? I thought it was, but if it's of little value, we could potentially just not talk about it.

I don't think it adds a lot of value to document, as I don't know if that's a questions users are even asking themselves. But I could be wrong and I don't feel strongly about it. What I'm slightly more concerned about the comment is that it seems like a contract, and I suspect we may want to relax that behavior in the future. E.g. for crop, we might want to apply it in YUV space instead of RGB if it's faster and if models can't notice the difference.

To me, when we say "TV is our ref", it means "this transforms has the same behavior as the TV transform as far as models are concerned". It's not strictly about bitwise equality (we'll never have that). It's only about whether the models can tell the difference. We know they can tell the difference for resize's interpolation mode. But if they can't tell the difference for (e.g.) crop being applied before or after color-conversion, I think we could allow ourselves to make that change of behavior. That allows us more freedom to potentially enable higher perf gains in the future.

None of my comments above are blocking. We can go ahead as-is. I'm happy that for once, I am not the one insisting on strictness :D

TODO_DECODER_TRANSFORMS_TUTORIAL.
custom_frame_mappings (str, bytes, or file-like object, optional):
Mapping of frames to their metadata, typically generated via ffprobe.
This enables accurate frame seeking without requiring a full video scan.
Expand Down Expand Up @@ -104,6 +112,7 @@ def __init__(
num_ffmpeg_threads: int = 1,
device: Optional[Union[str, torch_device]] = "cpu",
seek_mode: Literal["exact", "approximate"] = "exact",
transforms: Optional[Sequence[Union[DecoderTransform, nn.Module]]] = None,
custom_frame_mappings: Optional[
Union[str, bytes, io.RawIOBase, io.BufferedReader]
] = None,
Expand Down Expand Up @@ -148,13 +157,16 @@ def __init__(

device_variant = _get_cuda_backend()

transform_specs = _make_transform_specs(transforms)

core.add_video_stream(
self._decoder,
stream_index=stream_index,
dimension_order=dimension_order,
num_threads=num_ffmpeg_threads,
device=device,
device_variant=device_variant,
transform_specs=transform_specs,
custom_frame_mappings=custom_frame_mappings_data,
)

Expand Down Expand Up @@ -432,6 +444,78 @@ def _get_and_validate_stream_metadata(
)


def _convert_to_decoder_transforms(
transforms: Sequence[Union[DecoderTransform, nn.Module]],
) -> List[DecoderTransform]:
"""Convert a sequence of transforms that may contain TorchVision transform
objects into a list of only TorchCodec transform objects.

Args:
transforms: Squence of transform objects. The objects can be one of two
types:
1. torchcodec.transforms.DecoderTransform
2. torchvision.transforms.v2.Transform, but our type annotation
only mentions its base, nn.Module. We don't want to take a
hard dependency on TorchVision.

Returns:
List of DecoderTransform objects.
"""
try:
from torchvision.transforms import v2

tv_available = True
except ImportError:
tv_available = False

converted_transforms: list[DecoderTransform] = []
for transform in transforms:
if not isinstance(transform, DecoderTransform):
if not tv_available:
raise ValueError(
f"The supplied transform, {transform}, is not a TorchCodec "
" DecoderTransform. TorchCodec also accept TorchVision "
"v2 transforms, but TorchVision is not installed."
)
if isinstance(transform, v2.Resize):
Copy link
Contributor

@NicolasHug NicolasHug Nov 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this fails if tv_available is False? Because v2 wouldn't exist

EDIT ah no that's probably fine because of the if not tv_available: check above.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes me thing we should have a dummy job where we don't install TV that ensures TC still works fine...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On a job which doesn't have TorchVision installed: I agree we need to do something here, but I'd like to punt on this for now. The current testing file imports TorchVision unconditionally. I think we'll want to separate out the tests that require TorchVision from those that don't so that we can test both behaviors, but that will require different .py files. I'd like to deal with that in its own PR.

I actually started to add a step in the current linux wheel test that did not install TorchVision when I realized this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we can punt on this. I'm hoping we can do something very simple regarding testing: keep all but one test job using torchvision, and just have one small CI job that doesn't install TV and just runs a few tests, basically just insuring TV is an optional dependency. I'd like to avoid separating tests in different files just for that - we may have more than one optional dependency and that quickly becomes untractable.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit, I think I would have been less surprised by v2 being actually optional if this were elif.

Suggested change
if isinstance(transform, v2.Resize):
elif isinstance(transform, v2.Resize):

converted_transforms.append(Resize._from_torchvision(transform))
else:
raise ValueError(
f"Unsupported transform: {transform}. Transforms must be "
"either a TorchCodec DecoderTransform or a TorchVision "
"v2 transform."
)
else:
converted_transforms.append(transform)

return converted_transforms


def _make_transform_specs(
transforms: Optional[Sequence[Union[DecoderTransform, nn.Module]]],
) -> str:
"""Given a sequence of transforms, turn those into the specification string
the core API expects.

Args:
transforms: Optional sequence of transform objects. The objects can be
one of two types:
1. torchcodec.transforms.DecoderTransform
2. torchvision.transforms.v2.Transform, but our type annotation
only mentions its base, nn.Module. We don't want to take a
hard dependency on TorchVision.

Returns:
String of transforms in the format the core API expects: transform
specifications separate by semicolons.
"""
if transforms is None:
return ""

transforms = _convert_to_decoder_transforms(transforms)
return ";".join([t._make_params() for t in transforms])


def _read_custom_frame_mappings(
custom_frame_mappings: Union[str, bytes, io.RawIOBase, io.BufferedReader]
) -> tuple[Tensor, Tensor, Tensor]:
Expand Down
7 changes: 7 additions & 0 deletions src/torchcodec/transforms/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
# Copyright (c) Meta Platforms, Inc. and affiliates.
# All rights reserved.
#
# This source code is licensed under the BSD-style license found in the
# LICENSE file in the root directory of this source tree.

from ._decoder_transforms import DecoderTransform, Resize # noqa
85 changes: 85 additions & 0 deletions src/torchcodec/transforms/_decoder_transforms.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,85 @@
# Copyright (c) Meta Platforms, Inc. and affiliates.
# All rights reserved.
#
# This source code is licensed under the BSD-style license found in the
# LICENSE file in the root directory of this source tree.

from abc import ABC, abstractmethod
from dataclasses import dataclass
from typing import Sequence

from torch import nn


@dataclass
class DecoderTransform(ABC):
"""Base class for all decoder transforms.
A *decoder transform* is a transform that is applied by the decoder before
returning the decoded frame. Applying decoder transforms to frames
should be both faster and more memory efficient than receiving normally
decoded frames and applying the same kind of transform.
Most `DecoderTransform` objects have a complementary transform in TorchVision,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Annoyingly, single backticks in rst means italics. I think you wanted those to be code, like in markdown? For that we need double backticks (there are other instances of single backticks below and maybe in other files?)

Suggested change
Most `DecoderTransform` objects have a complementary transform in TorchVision,
Most ``DecoderTransform`` objects have a complementary transform in TorchVision,

specificially in
`torchvision.transforms.v2 <https://docs.pytorch.org/vision/stable/transforms.html#v2-api-reference-recommended>`_.
For such transforms, we ensure that:
1. The names are the same.
2. Default behaviors are the same.
3. The parameters for the `DecoderTransform` object are a subset of the
TorchVision transform object.
4. Parameters with the same name control the same behavior and accept a
subset of the same types.
5. The difference between the frames returned by a decoder transform and
the complementary TorchVision transform are small.
All decoder transforms are applied in the output pixel format and colorspace.
"""

@abstractmethod
def _make_params(self) -> str:
pass


@dataclass
class Resize(DecoderTransform):
"""Resize the decoded frame to a given size.
Complementary TorchVision transform:
`torchvision.transforms.v2.Resize <https://docs.pytorch.org/vision/stable/generated/torchvision.transforms.v2.Resize.html#torchvision.transforms.v2.Resize>`_.
Interpolation is always bilinear. Anti-aliasing is always on.
Args:
size: (sequence of int): Desired output size. Must be a sequence of
the form (height, width).
"""

size: Sequence[int]

def _make_params(self) -> str:
assert len(self.size) == 2
return f"resize, {self.size[0]}, {self.size[1]}"
Comment on lines +60 to +62
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we call it something else than _make_params?

make_params exists for the v2 transforms, but it does something quite different.


Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note this class method below is new. Because I'm trying to exhaustively catch all of the v2.Resize options we don't support, the code for turning a v2.Resize into a torchcodec.transforms.Resize got more involved. Extrapolated across more transforms, this kind of logic would end up dominating the code in _video_decoder.py. By make this a private class method, we can put all logic related to what in v2.Resize we support and how to turn a v2.Resize into a torchcodec.transforms.Resize in one place.

Also, to state it explicit, _from_torchvision() and _make_params() are private methods so they're not publicly documented. Users shouldn't need to know about them.

@classmethod
def _from_torchvision(cls, resize_tv: nn.Module):
from torchvision.transforms import v2
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd suggest the following:

try:
    from torchvision.transforms import v2
except ImportError from e:
    raise RuntimeError("Couldn't find TorchVision - this should never happen, please report a bug") from e

This should probably be in a helper function, reused across classes. My goal here is mainly to help the reader (us, the devs) understand that this code-path is only expected to be run when TV is already available. Otherwise, the plain import makes it look like v2 could be a hard dep.


assert isinstance(resize_tv, v2.Resize)

if resize_tv.interpolation is not v2.InterpolationMode.BILINEAR:
raise ValueError(
"TorchVision Resize transform must use bilinear interpolation."
)
if resize_tv.antialias is False:
raise ValueError(
"TorchVision Resize transform must have antialias enabled."
)
if resize_tv.size is None:
raise ValueError("TorchVision Resize transform must have a size specified.")
if len(resize_tv.size) != 2:
raise ValueError(
"TorchVision Resize transform must have a (height, width) "
f"pair for the size, got {resize_tv.size}."
)
return cls(size=resize_tv.size)
Loading
Loading