.. only:: html

    .. note::
        :class: sphx-glr-download-link-note

        Click :ref:`here <sphx_glr_download_examples_apps_lightning_classy_vision_component.py>`     to download the full example code
    .. rst-class:: sphx-glr-example-title

    .. _sphx_glr_examples_apps_lightning_classy_vision_component.py:


Trainer Component Examples
==========================

Component definitions that run the example lightning_classy_vision app
in a single or distributed manner.

Before executing examples, install torchx and dependencies necessary to run examples:

.. code:: bash

   pip install torchx
   git clone https://github.com/pytorch/torchx.git
   cd torchx
   pip install -r dev-requirements.txt

.. note::

     The working dir should be `torchx` to run the components.


.. code-block:: default


    # Importing torchx api specifications

    from typing import Optional, Dict, List

    import torchx.specs.api as torchx
    from torchx.specs import macros, named_resources, Resource
    from torchx.version import TORCHX_IMAGE


Single Trainer Component
#########################
Defines a single trainer component

Use the following cmd to try out:

.. code:: bash

   torchx run --scheduler local_cwd \
   ./torchx/examples/apps/lightning_classy_vision/component.py:trainer \
   --output_path /tmp/$USER

Single trainer component code:


.. code-block:: default


    def trainer(
        output_path: str,
        image: str = TORCHX_IMAGE,
        data_path: Optional[str] = None,
        load_path: str = "",
        log_path: str = "/tmp/logs",
        resource: Optional[str] = None,
        env: Optional[Dict[str, str]] = None,
        skip_export: bool = False,
        epochs: int = 1,
        layers: Optional[List[int]] = None,
        learning_rate: Optional[float] = None,
        num_samples: int = 200,
    ) -> torchx.AppDef:
        """Runs the example lightning_classy_vision app.

        Args:
            output_path: output path for model checkpoints (e.g. file:///foo/bar)
            image: image to run (e.g. foobar:latest)
            load_path: path to load pretrained model from
            data_path: path to the data to load, if data_path is not provided,
                auto generated test data will be used
            log_path: path to save tensorboard logs to
            resource: the resources to use
            env: env variables for the app
            skip_export: disable model export
            epochs: number of epochs to run
            layers: the number of convolutional layers and their number of channels
            learning_rate: the learning rate to use for training
            num_samples: the number of images to run per batch, use 0 to run on all
        """
        env = env or {}
        args = [
            "-m",
            "torchx.examples.apps.lightning_classy_vision.train",
            "--output_path",
            output_path,
            "--load_path",
            load_path,
            "--log_path",
            log_path,
            "--epochs",
            str(epochs),
        ]
        if layers:
            args += ["--layers"] + [str(layer) for layer in layers]
        if learning_rate:
            args += ["--lr", str(learning_rate)]
        if num_samples and num_samples > 0:
            args += ["--num_samples", str(num_samples)]
        if data_path:
            args += ["--data_path", data_path]
        if skip_export:
            args.append("--skip_export")
        return torchx.AppDef(
            name="cv-trainer",
            roles=[
                torchx.Role(
                    name="worker",
                    entrypoint="python",
                    args=args,
                    env=env,
                    image=image,
                    resource=named_resources[resource]
                    if resource
                    else torchx.Resource(cpu=1, gpu=0, memMB=3000),
                )
            ],
        )


Distributed Trainer Component
##############################
Defines distributed trainer component

Use the following cmd to try out:

.. code:: bash

      torchx run --scheduler local_cwd \
      ./torchx/examples/apps/lightning_classy_vision/component.py:trainer_dist \
      --output_path /tmp --rdzv_backend c10d --rdzv_endpoint localhost:29500


Executing distributed trainer job
#################################

TorchX supports kubernetes scheduler that allows users to execute distributed job in kubernetes cluster.
It uses `Volcano <https://volcano.sh/en/>`_ to schedule jobs.

Distributed trainer uses `torch.distributed.run <https://pytorch.org/docs/stable/elastic/quickstart.html>`_
to start user processes. There are two rendezvous types that can be used to execute
distributed jobs: `c10d` and `etcd`.

Prerequisites to execute distributed jobs on kubernetes cluster:

* Install volcano 1.4.0 version

.. code:: bash

      kubectl apply -f https://raw.githubusercontent.com/volcano-sh/volcano/v1.4.0/installer/volcano-development.yaml

* Install etcd server on your kubernetes cluster:

      kubectl apply -f https://github.com/pytorch/torchx/blob/main/resources/etcd.yaml

After that the job can be executed on kubernetes:

.. code:: bash

 torchx run --scheduler kubernetes --scheduler_args namespace=default,queue=default \
 ./torchx/examples/apps/lightning_classy_vision/component.py:trainer_dist \
 --nnodes 2 --epochs 1 --output_path /tmp

Distributed trainer component code:


.. code-block:: default


    def trainer_dist(
        output_path: str,
        image: str = TORCHX_IMAGE,
        data_path: Optional[str] = None,
        load_path: str = "",
        log_path: str = "/tmp/logs",
        resource: Optional[str] = None,
        env: Optional[Dict[str, str]] = None,
        skip_export: bool = False,
        epochs: int = 1,
        nnodes: int = 1,
        nproc_per_node: int = 1,
        rdzv_backend: str = "etcd",
        rdzv_endpoint: str = "etcd-server:2379",
    ) -> torchx.AppDef:
        """Runs the example lightning_classy_vision app.

        Args:
            output_path: output path for model checkpoints (e.g. file:///foo/bar)
            image: image to run (e.g. foobar:latest)
            load_path: path to load pretrained model from
            data_path: path to the data to load, if data_path is not provided,
                auto generated test data will be used
            log_path: path to save tensorboard logs to
            resource: the resources to use
            env: env variables for the app
            skip_export: disable model export
            epochs: number of epochs to run
            nnodes: number of nodes to run train on, default 1
            nproc_per_node: number of processes per node. Each process
                is assumed to use a separate GPU, default 1
            rdzv_backend: rendezvous backend to use, allowed values can be found at
                `repistry <https://github.com/pytorch/pytorch/blob/master/torch/distributed/elastic/rendezvous/registry.py>`_
                The default backend is ``etcd``
            rdzv_endpoint: Controller endpoint. In case of rdzv_backend is etcd, this is a etcd
                endpoint, in case of c10d, this is the endpoint of one of the hosts.
                The default endpoint is ``etcd-server:2379``
        """
        env = env or {}
        args = [
            "-m",
            "torch.distributed.run",
            "--rdzv_backend",
            rdzv_backend,
            "--rdzv_endpoint",
            rdzv_endpoint,
            "--rdzv_id",
            f"{macros.app_id}",
            "--nnodes",
            str(nnodes),
            "--nproc_per_node",
            str(nproc_per_node),
            "-m",
            "torchx.examples.apps.lightning_classy_vision.train",
            "--output_path",
            output_path,
            "--load_path",
            load_path,
            "--log_path",
            log_path,
            "--epochs",
            str(epochs),
        ]
        if data_path:
            args += ["--data_path", data_path]
        if skip_export:
            args.append("--skip_export")
        resource_def = (
            named_resources[resource]
            if resource
            else Resource(cpu=nnodes, gpu=0, memMB=3000)
        )
        return torchx.AppDef(
            name="cv-trainer",
            roles=[
                torchx.Role(
                    name="worker",
                    entrypoint="python",
                    args=args,
                    env=env,
                    image=image,
                    resource=resource_def,
                    num_replicas=nnodes,
                )
            ],
        )


Interpreting the Model
#######################
Defines a component that interprets the model

Train a single trainer example: :ref:`examples_apps/lightning_classy_vision/component:Single Trainer Component`
And use the following cmd to try out:

.. code:: bash

   torchx run --scheduler local_cwd \
   ./torchx/examples/apps/lightning_classy_vision/component.py:interpret \
   --output_path /tmp/aivanou/interpret  --load_path /tmp/$USER/last.ckpt


.. code-block:: default


    def interpret(
        load_path: str,
        output_path: str,
        data_path: Optional[str] = None,
        image: str = TORCHX_IMAGE,
        resource: Optional[str] = None,
    ) -> torchx.AppDef:
        """Runs the model interpretability app on the model outputted by the training
        component.

        Args:
            load_path: path to load pretrained model from
            output_path: output path for model checkpoints (e.g. file:///foo/bar)
            data_path: path to the data to load
            image: image to run (e.g. foobar:latest)
            resource: the resources to use
        """
        args = [
            "-m",
            "torchx.examples.apps.lightning_classy_vision.interpret",
            "--load_path",
            load_path,
            "--output_path",
            output_path,
        ]
        if data_path:
            args += [
                "--data_path",
                data_path,
            ]

        return torchx.AppDef(
            name="cv-interpret",
            roles=[
                torchx.Role(
                    name="worker",
                    entrypoint="python",
                    args=args,
                    image=image,
                    resource=named_resources[resource]
                    if resource
                    else torchx.Resource(cpu=1, gpu=0, memMB=1024),
                )
            ],
        )


.. rst-class:: sphx-glr-timing

   **Total running time of the script:** ( 0 minutes  0.000 seconds)


.. _sphx_glr_download_examples_apps_lightning_classy_vision_component.py:


.. only :: html

 .. container:: sphx-glr-footer
    :class: sphx-glr-footer-example


  .. container:: sphx-glr-download sphx-glr-download-python

     :download:`Download Python source code: component.py <component.py>`


  .. container:: sphx-glr-download sphx-glr-download-jupyter

     :download:`Download Jupyter notebook: component.ipynb <component.ipynb>`


.. only:: html

 .. rst-class:: sphx-glr-signature

    `Gallery generated by Sphinx-Gallery <https://sphinx-gallery.github.io>`_