Shortcuts

Quickstart

Tip

Install TorchX, write a simple app, and launch it locally and remotely – including distributed jobs. Estimated time: 10–15 minutes.

Installation

Install TorchX (provides the torchx CLI and the Runner Python API):

$ pip install "torchx[dev]"

Verify the installation:

$ torchx --help

Hello World

Create a simple my_app.py:

import sys

print(f"Hello, {sys.argv[1]}!")

Launching

Launch the app with torchx run. The scheduler is the backend that runs the job – local_cwd runs it in your current directory. You’ll use the utils.python component (a reusable job template):

$ torchx run --scheduler local_cwd utils.python --help

The component takes a script name; extra arguments are passed through to the script.

$ torchx run --scheduler local_cwd utils.python --script my_app.py "your name"

Using the Python API

The same operations are available via get_runner():

from torchx.runner import get_runner

with get_runner() as runner:
    app_handle = runner.run_component(
        "utils.python",
        ["--script", "my_app.py", "your name"],
        scheduler="local_cwd",
    )
    # Wait for the job to complete and print its final status
    final_status = runner.wait(app_handle, wait_interval=1)
    print(final_status)

You can also construct an AppDef directly and pass it to run():

import torchx.specs as specs
from torchx.runner import get_runner

app = specs.AppDef(
    name="hello",
    roles=[
        specs.Role(
            name="worker",
            entrypoint="python",
            # "image" is the base runtime environment. For local schedulers
            # it's a filesystem path; for container schedulers it's a Docker
            # image name (e.g. "my_image:latest").
            image="/tmp",
            args=["my_app.py", "your name"],
        )
    ],
)

with get_runner() as runner:
    app_handle = runner.run(app, scheduler="local_cwd")

The local_docker scheduler packages your local workspace as a layer on top of the specified image – a close approximation of remote container environments.

Note

This requires Docker installed and won’t work in environments such as Google Colab. See the Docker install instructions: https://docs.docker.com/get-docker/

$ torchx run --scheduler local_docker utils.python --script my_app.py "your name"

TorchX defaults to using the ghcr.io/pytorch/torchx Docker container image which contains the PyTorch libraries, TorchX and related dependencies.

Distributed

The dist.ddp component (DDP = Distributed Data Parallel) uses TorchElastic to manage workers, enabling multi-node jobs on all supported schedulers.

$ torchx run --scheduler local_docker dist.ddp --help

Create dist_app.py:

import torch
import torch.distributed as dist

dist.init_process_group(backend="gloo")
print(f"I am worker {dist.get_rank()} of {dist.get_world_size()}!")

a = torch.tensor([dist.get_rank()])
dist.all_reduce(a)
print(f"all_reduce output = {a}")

Launch with 2 nodes and 2 workers per node (-j 2x2 = <nodes>x<workers_per_node>):

$ torchx run --scheduler local_docker dist.ddp -j 2x2 --script dist_app.py

Workspaces / Patching

TorchX uses workspaces to automatically overlay your local code onto the job’s base image, so you don’t need to rebuild and push a Docker image after every code change. See torchx.workspace for details.

.torchxconfig

Configure scheduler defaults in a .torchxconfig file instead of passing -cfg flags every time:

[kubernetes]
queue=torchx
image_repo=<your docker image repository>

[slurm]
partition=torchx

Remote Schedulers

The same torchx run command works on remote schedulers – only the --scheduler flag changes.

$ torchx run --scheduler slurm dist.ddp -j 2x2 --script dist_app.py
$ torchx run --scheduler kubernetes dist.ddp -j 2x2 --script dist_app.py
$ torchx run --scheduler aws_batch dist.ddp -j 2x2 --script dist_app.py

List all scheduler-specific options:

$ torchx runopts

Custom Images

Docker-based Schedulers

Provide a custom Dockerfile to add libraries beyond the standard PyTorch set.

Create timm_app.py:

import timm

print(timm.models.resnet18())

Create Dockerfile.torchx:

FROM pytorch/pytorch:2.6.0-cuda12.6-cudnn9-runtime

RUN pip install timm

COPY . .

TorchX uses this Dockerfile automatically:

$ torchx run --scheduler local_docker utils.python --script timm_app.py

Slurm

The slurm and local_cwd schedulers use the current environment, so pip and conda work as usual.

Next Steps

  1. Explore the API Quick Reference for copy-pasteable recipes

  2. Explore the torchx CLI and the Runner Python API

  3. Review supported schedulers

  4. Browse builtin components

See also

Basic Concepts

Core concepts behind AppDef, Component, Runner, and Scheduler.

.torchxconfig

Configuring scheduler options via .torchxconfig.

Custom Components

Writing and registering your own components.

Docs

Access comprehensive developer documentation for PyTorch

View Docs

Tutorials

Get in-depth tutorials for beginners and advanced developers

View Tutorials

Resources

Find development resources and get your questions answered

View Resources