.. DO NOT EDIT.
.. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY.
.. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE:
.. "generated_examples/decoding/transforms.py"
.. LINE NUMBERS ARE GIVEN BELOW.

.. rst-class:: sphx-glr-example-title

.. _sphx_glr_generated_examples_decoding_transforms.py:


.. meta::
   :description: Learn how to apply transforms during video decoding for improved memory efficiency and performance.

=======================================================
Decoder Transforms: Applying transforms during decoding
=======================================================

In this example, we will demonstrate how to use the ``transforms`` parameter of
the :class:`~torchcodec.decoders.VideoDecoder` class. This parameter allows us
to specify a list of :class:`torchcodec.transforms.DecoderTransform` or
:class:`torchvision.transforms.v2.Transform` objects. These objects serve as
transform specifications that the :class:`~torchcodec.decoders.VideoDecoder`
will apply during the decoding process.

.. GENERATED FROM PYTHON SOURCE LINES 24-26

First, a bit of boilerplate, definitions that we will use later. You can skip
ahead to our :ref:`example_video` or :ref:`applying_transforms`.

.. GENERATED FROM PYTHON SOURCE LINES 26-64

.. code-block:: Python


    import torch
    import requests
    import tempfile
    from pathlib import Path
    import shutil
    from time import perf_counter_ns


    def store_video_to(url: str, local_video_path: Path):
        response = requests.get(url, headers={"User-Agent": ""})
        if response.status_code != 200:
            raise RuntimeError(f"Failed to download video. {response.status_code = }.")

        with open(local_video_path, 'wb') as f:
            for chunk in response.iter_content():
                f.write(chunk)


    def plot(frames: torch.Tensor, title : str | None = None):
        try:
            from torchvision.utils import make_grid
            from torchvision.transforms.v2.functional import to_pil_image
            import matplotlib.pyplot as plt
        except ImportError:
            print("Cannot plot, please run `pip install torchvision matplotlib`")
            return

        plt.rcParams["savefig.bbox"] = "tight"
        dpi = 300
        fig, ax = plt.subplots(figsize=(800 / dpi, 600 / dpi), dpi=dpi)
        ax.imshow(to_pil_image(make_grid(frames)))
        ax.set(xticklabels=[], yticklabels=[], xticks=[], yticks=[])
        if title is not None:
            ax.set_title(title, fontsize=6)
        plt.tight_layout()


.. GENERATED FROM PYTHON SOURCE LINES 65-73

.. _example_video:

Our example video
-----------------

We'll download a video from the internet and store it locally. We're
purposefully retrieving a high resolution video to demonstrate using
transforms to reduce the dimensions.

.. GENERATED FROM PYTHON SOURCE LINES 73-86

.. code-block:: Python


    # Video source: https://www.pexels.com/video/an-african-penguin-at-the-beach-9140346/
    # Author: Taryn Elliott.
    url = "https://videos.pexels.com/video-files/9140346/9140346-uhd_3840_2160_25fps.mp4"

    temp_dir = tempfile.mkdtemp()
    penguin_video_path = Path(temp_dir) / "penguin.mp4"
    store_video_to(url, penguin_video_path)

    from torchcodec.decoders import VideoDecoder
    print(f"Penguin video metadata: {VideoDecoder(penguin_video_path).metadata}")


.. rst-class:: sphx-glr-script-out

 .. code-block:: none

    Penguin video metadata: VideoStreamMetadata:
      duration_seconds_from_header: 37.24
      begin_stream_seconds_from_header: 0.0
      bit_rate: 24879454.0
      codec: h264
      stream_index: 0
      duration_seconds: 37.24
      begin_stream_seconds: 0.0
      begin_stream_seconds_from_content: 0.0
      end_stream_seconds_from_content: 37.24
      width: 3840
      height: 2160
      num_frames_from_header: 931
      num_frames_from_content: 931
      average_fps_from_header: 25.0
      pixel_aspect_ratio: 0
      end_stream_seconds: 37.24
      num_frames: 931
      average_fps: 25.0


.. GENERATED FROM PYTHON SOURCE LINES 87-97

As shown above, the video is 37 seconds long and has a height of 2160 pixels
and a width of 3840 pixels.

.. note::

    The colloquial way to report the dimensions of this video would be as
    3840x2160; that is, (`width`, `height`). In the PyTorch ecosystem, image
    dimensions are typically expressed as (`height`, `width`). The remainder
    of this tutorial uses the PyTorch convention of (`height`, `width`) to
    specify image dimensions.

.. GENERATED FROM PYTHON SOURCE LINES 99-108

.. _applying_transforms:

Applying transforms during pre-processing
-----------------------------------------

A pre-processing pipeline for videos during training will typically apply a
set of transforms for a variety of reasons. Below is a simple example of
applying TorchVision's :class:`~torchvision.transforms.v2.Resize` transform to a single
frame **after** the decoder returns it:

.. GENERATED FROM PYTHON SOURCE LINES 108-117

.. code-block:: Python


    from torchvision.transforms import v2

    full_decoder = VideoDecoder(penguin_video_path)
    frame = full_decoder[5]
    resized_after = v2.Resize(size=(480, 640))(frame)

    plot(resized_after, title="Resized to 480x640 after decoding")


.. image-sg:: /generated_examples/decoding/images/sphx_glr_transforms_001.png
   :alt: Resized to 480x640 after decoding
   :srcset: /generated_examples/decoding/images/sphx_glr_transforms_001.png
   :class: sphx-glr-single-img


.. GENERATED FROM PYTHON SOURCE LINES 118-122

In the example above, ``full_decoder`` returns a video frame that has the
dimensions (2160, 3840) which is then resized down to (480, 640). But with the
``transforms`` parameter of :class:`~torchcodec.decoders.VideoDecoder` we can
specify for the resize to  happen **during** decoding!

.. GENERATED FROM PYTHON SOURCE LINES 122-131

.. code-block:: Python


    resize_decoder = VideoDecoder(
        penguin_video_path,
        transforms=[v2.Resize(size=(480, 640))]
    )
    resized_during = resize_decoder[5]

    plot(resized_during, title="Resized to 480x640 during decoding")


.. image-sg:: /generated_examples/decoding/images/sphx_glr_transforms_002.png
   :alt: Resized to 480x640 during decoding
   :srcset: /generated_examples/decoding/images/sphx_glr_transforms_002.png
   :class: sphx-glr-single-img


.. GENERATED FROM PYTHON SOURCE LINES 132-150

TorchCodec's relationship to TorchVision transforms
-----------------------------------------------------
Notably, in our examples we are passing in TorchVision
:class:`~torchvision.transforms.v2.Transform` objects as our transforms.
However, :class:`~torchcodec.decoders.VideoDecoder` accepts TorchVision
transforms as a matter of convenience. TorchVision is **not required** to use
decoder transforms.

Every TorchVision transform that :class:`~torchcodec.decoders.VideoDecoder` accepts
has a complementary transform defined in :mod:`torchcodec.transforms`. We
would have gotten equivalent behavior if we had passed in the
:class:`torchcodec.transforms.Resize` object that is a part of TorchCodec.
:class:`~torchcodec.decoders.VideoDecoder` accepts both objects as a matter of
convenience and to clarify the relationship between the transforms that TorchCodec
applies and the transforms that TorchVision offers.

Importantly, the two frames are not identical, even though we can see they
*look* very similar:

.. GENERATED FROM PYTHON SOURCE LINES 150-154

.. code-block:: Python


    abs_diff = (resized_after.float() - resized_during.float()).abs()
    (abs_diff == 0).all()


.. rst-class:: sphx-glr-script-out

 .. code-block:: none


    tensor(False)


.. GENERATED FROM PYTHON SOURCE LINES 155-156

But they're close enough that models won't be able to tell a difference:

.. GENERATED FROM PYTHON SOURCE LINES 156-158

.. code-block:: Python

    assert (abs_diff <= 1).float().mean() >= 0.998


.. GENERATED FROM PYTHON SOURCE LINES 159-198

While :class:`~torchcodec.decoders.VideoDecoder` accepts TorchVision transforms as
*specifications*, it is not actually using the TorchVision implementation of these
transforms. Instead, it is mapping them to equivalent
`FFmpeg filters <https://ffmpeg.org/ffmpeg-filters.html>`_. That is,
:class:`torchvision.transforms.v2.Resize` and :class:`torchcodec.transforms.Resize` are mapped to
`scale <https://ffmpeg.org/ffmpeg-filters.html#scale-1>`_; and
:class:`torchvision.transforms.v2.CenterCrop` and :class:`torchcodec.transforms.CenterCrop` are mapped to
`crop <https://ffmpeg.org/ffmpeg-filters.html#crop>`_.

The relationships we ensure between TorchCodec :class:`~torchcodec.transforms.DecoderTransform` objects
and TorchVision :class:`~torchvision.transforms.v2.Transform` objects are:

     1. The names are the same.
     2. Default behaviors are the same.
     3. The parameters for the :class:`~torchcodec.transforms.DecoderTransform`
        object are a subset of the TorchVision :class:`~torchvision.transforms.v2.Transform`
        object.
     4. Parameters with the same name control the same behavior and accept a
        subset of the same types.
     5. The difference between the frames returned by a decoder transform and
        the complementary TorchVision transform are such that a model should
        not be able to tell the difference.

.. note::

    Applying the exact same transforms during training and inference is
    important for model perforamnce. For example, if you use decoder
    transforms to resize frames during training, you should also use decoder
    transforms to resize frames during inference. We provide the similarity
    guarantees to mitigate the harm when the two techniques are
    *unintentionally* mixed. That is, if you use decoder transforms to resize
    frames during training, but use TorchVisions's
    :class:`~torchvision.transforms.v2.Resize` during inference, our guarantees
    mitigate the harm to model performance. But we **reccommend against** this kind of
    mixing.

    It is appropriate and expected to use some decoder transforms and some TorchVision
    transforms, as long as the exact same pre-processing operations are performed during
    training and inference.

.. GENERATED FROM PYTHON SOURCE LINES 200-210

Decoder transform pipelines
---------------------------
So far, we've only provided a single transform to the ``transform`` parameter to
:class:`~torchcodec.decoders.VideoDecoder`. But it
actually accepts a list of transforms, which become a pipeline of transforms.
The order of the list matters: the first transform in the list will receive
the originally decoded frame. The output of that transform becomes the input
to the next transform in the list, and so on.

A simple example:

.. GENERATED FROM PYTHON SOURCE LINES 210-221

.. code-block:: Python


    crop_resize_decoder = VideoDecoder(
        penguin_video_path,
        transforms = [
            v2.CenterCrop(size=(1280, 1664)),
            v2.Resize(size=(480, 640)),
        ]
    )
    crop_resized_during = crop_resize_decoder[5]
    plot(crop_resized_during, title="Center cropped then resized to 480x640")


.. image-sg:: /generated_examples/decoding/images/sphx_glr_transforms_003.png
   :alt: Center cropped then resized to 480x640
   :srcset: /generated_examples/decoding/images/sphx_glr_transforms_003.png
   :class: sphx-glr-single-img


.. GENERATED FROM PYTHON SOURCE LINES 222-241

Performance: memory efficiency and speed
----------------------------------------

The main motivation for decoder transforms is *memory efficiency*,
particularly when applying transforms that reduce the size of a frame, such
as resize and crop. Because the FFmpeg layer knows all of the transforms it
needs to apply during decoding, it's able to efficiently reuse memory.
Further, full resolution frames are never returned to the Python layer.  As a
result, there is significantly less total memory needed and less pressure on
the Python garbage collector.

In `benchmarks <https://github.com/meta-pytorch/torchcodec/blob/f6a816190cbcac417338c29d5e6fac99311d054f/benchmarks/decoders/benchmark_transforms.py>`_
reducing frames from (1080, 1920) down to (135, 240), we have observed a
reduction in peak resident set size from 4.3 GB to 0.4 GB.

There is sometimes a runtime benefit, but it is dependent on the number of
threads that the :class:`~torchcodec.decoders.VideoDecoder` tells FFmpeg
to use. We define the following benchmark function, as well as the functions
to benchmark:

.. GENERATED FROM PYTHON SOURCE LINES 241-303

.. code-block:: Python


    def bench(f, average_over=3, warmup=1, **f_kwargs):
        for _ in range(warmup):
            f(**f_kwargs)

        times = []
        for _ in range(average_over):
            start_time = perf_counter_ns()
            f(**f_kwargs)
            end_time = perf_counter_ns()
            times.append(end_time - start_time)

        times = torch.tensor(times) * 1e-6  # ns to ms
        times_std = times.std().item()
        times_med = times.median().item()
        return f"{times_med = :.2f}ms +- {times_std:.2f}"


    from torchcodec import samplers


    def sample_decoder_transforms(num_threads: int):
        decoder = VideoDecoder(
            penguin_video_path,
            transforms = [
                v2.CenterCrop(size=(1280, 1664)),
                v2.Resize(size=(480, 640)),
            ],
            seek_mode="approximate",
            num_ffmpeg_threads=num_threads,
        )
        transformed_frames = samplers.clips_at_regular_indices(
            decoder,
            num_clips=1,
            num_frames_per_clip=200
        )
        assert len(transformed_frames.data[0]) == 200


    def sample_torchvision_transforms(num_threads: int):
        if num_threads > 0:
            torch.set_num_threads(num_threads)
        decoder = VideoDecoder(
            penguin_video_path,
            seek_mode="approximate",
            num_ffmpeg_threads=num_threads,
        )
        frames = samplers.clips_at_regular_indices(
            decoder,
            num_clips=1,
            num_frames_per_clip=200
        )
        transforms = v2.Compose(
            [
                v2.CenterCrop(size=(1280, 1664)),
                v2.Resize(size=(480, 640)),
            ]
        )
        transformed_frames = transforms(frames.data)
        assert transformed_frames.shape[1] == 200


.. GENERATED FROM PYTHON SOURCE LINES 304-309

When the :class:`~torchcodec.decoders.VideoDecoder` object sets the number of
FFmpeg threads to 0, that tells FFmpeg to determine how many threads to use
based on what is available on the current system. In such cases, decoder transforms
will tend to outperform getting back a full frame and applying TorchVision transforms
sequentially:

.. GENERATED FROM PYTHON SOURCE LINES 309-314

.. code-block:: Python


    print(f"decoder transforms:    {bench(sample_decoder_transforms, num_threads=0)}")
    print(f"torchvision transform: {bench(sample_torchvision_transforms, num_threads=0)}")


.. rst-class:: sphx-glr-script-out

 .. code-block:: none

    decoder transforms:    times_med = 2324.79ms +- 6.86
    torchvision transform: times_med = 5226.43ms +- 36.37


.. GENERATED FROM PYTHON SOURCE LINES 315-319

The reason is that FFmpeg is applying the decoder transforms in parallel.
However, if the number of threads is 1 (as is the default), then there is often
less benefit to using decoder transforms. Using the TorchVision transforms may
even be faster!

.. GENERATED FROM PYTHON SOURCE LINES 319-323

.. code-block:: Python


    print(f"decoder transforms:    {bench(sample_decoder_transforms, num_threads=1)}")
    print(f"torchvision transform: {bench(sample_torchvision_transforms, num_threads=1)}")


.. rst-class:: sphx-glr-script-out

 .. code-block:: none

    decoder transforms:    times_med = 15140.43ms +- 6.01
    torchvision transform: times_med = 17282.33ms +- 40.54


.. GENERATED FROM PYTHON SOURCE LINES 324-333

In brief, our performance guidance is:

   1. If you are applying a transform pipeline that signficantly reduces
      the dimensions of your input frames and memory efficiency matters, use
      decoder transforms.
   2. If you are using multiple FFmpeg threads, decoder transforms may be
      faster. Experiment with your setup to verify.
   3. If you are using a single FFmpeg thread, then decoder transforms may
      be slower. Experiment with your setup to verify.

.. GENERATED FROM PYTHON SOURCE LINES 333-335

.. code-block:: Python


    shutil.rmtree(temp_dir)


.. rst-class:: sphx-glr-timing

   **Total running time of the script:** (3 minutes 19.378 seconds)


.. _sphx_glr_download_generated_examples_decoding_transforms.py:

.. only:: html

  .. container:: sphx-glr-footer sphx-glr-footer-example

    .. container:: sphx-glr-download sphx-glr-download-jupyter

      :download:`Download Jupyter notebook: transforms.ipynb <transforms.ipynb>`

    .. container:: sphx-glr-download sphx-glr-download-python

      :download:`Download Python source code: transforms.py <transforms.py>`

    .. container:: sphx-glr-download sphx-glr-download-zip

      :download:`Download zipped: transforms.zip <transforms.zip>`


.. only:: html

 .. rst-class:: sphx-glr-signature

    `Gallery generated by Sphinx-Gallery <https://sphinx-gallery.github.io>`_