Quickstart¶
Tip
Install TorchX, write a simple app, and launch it locally and remotely – including distributed jobs. Estimated time: 10–15 minutes.
Installation¶
Install TorchX (provides the torchx CLI and the
Runner Python API):
$ pip install "torchx[dev]"
Verify the installation:
$ torchx --help
Hello World¶
Create a simple my_app.py:
import sys
print(f"Hello, {sys.argv[1]}!")
Launching¶
Launch the app with torchx run. The scheduler is the
backend that runs the job – local_cwd runs it in your current directory.
You’ll use the utils.python component (a reusable job
template):
$ torchx run --scheduler local_cwd utils.python --help
The component takes a script name; extra arguments are passed through to the script.
$ torchx run --scheduler local_cwd utils.python --script my_app.py "your name"
Using the Python API¶
The same operations are available via get_runner():
from torchx.runner import get_runner
with get_runner() as runner:
app_handle = runner.run_component(
"utils.python",
["--script", "my_app.py", "your name"],
scheduler="local_cwd",
)
# Wait for the job to complete and print its final status
final_status = runner.wait(app_handle, wait_interval=1)
print(final_status)
You can also construct an AppDef directly and pass
it to run():
import torchx.specs as specs
from torchx.runner import get_runner
app = specs.AppDef(
name="hello",
roles=[
specs.Role(
name="worker",
entrypoint="python",
# "image" is the base runtime environment. For local schedulers
# it's a filesystem path; for container schedulers it's a Docker
# image name (e.g. "my_image:latest").
image="/tmp",
args=["my_app.py", "your name"],
)
],
)
with get_runner() as runner:
app_handle = runner.run(app, scheduler="local_cwd")
The local_docker scheduler packages your local workspace as a layer on top
of the specified image – a close approximation of remote container environments.
Note
This requires Docker installed and won’t work in environments such as Google Colab. See the Docker install instructions: https://docs.docker.com/get-docker/
$ torchx run --scheduler local_docker utils.python --script my_app.py "your name"
TorchX defaults to using the ghcr.io/pytorch/torchx Docker container image which contains the PyTorch libraries, TorchX and related dependencies.
Distributed¶
The dist.ddp component (DDP = Distributed Data Parallel) uses
TorchElastic
to manage workers, enabling multi-node jobs on all supported schedulers.
$ torchx run --scheduler local_docker dist.ddp --help
Create dist_app.py:
import torch
import torch.distributed as dist
dist.init_process_group(backend="gloo")
print(f"I am worker {dist.get_rank()} of {dist.get_world_size()}!")
a = torch.tensor([dist.get_rank()])
dist.all_reduce(a)
print(f"all_reduce output = {a}")
Launch with 2 nodes and 2 workers per node (-j 2x2 = <nodes>x<workers_per_node>):
$ torchx run --scheduler local_docker dist.ddp -j 2x2 --script dist_app.py
Workspaces / Patching¶
TorchX uses workspaces to automatically overlay your local code onto the job’s base image, so you don’t need to rebuild and push a Docker image after every code change. See torchx.workspace for details.
.torchxconfig¶
Configure scheduler defaults in a .torchxconfig file instead of passing
-cfg flags every time:
[kubernetes]
queue=torchx
image_repo=<your docker image repository>
[slurm]
partition=torchx
Remote Schedulers¶
The same torchx run command works on remote schedulers – only the
--scheduler flag changes.
$ torchx run --scheduler slurm dist.ddp -j 2x2 --script dist_app.py
$ torchx run --scheduler kubernetes dist.ddp -j 2x2 --script dist_app.py
$ torchx run --scheduler aws_batch dist.ddp -j 2x2 --script dist_app.py
List all scheduler-specific options:
$ torchx runopts
Custom Images¶
Docker-based Schedulers¶
Provide a custom Dockerfile to add libraries beyond the standard PyTorch set.
Create timm_app.py:
import timm
print(timm.models.resnet18())
Create Dockerfile.torchx:
FROM pytorch/pytorch:2.6.0-cuda12.6-cudnn9-runtime
RUN pip install timm
COPY . .
TorchX uses this Dockerfile automatically:
$ torchx run --scheduler local_docker utils.python --script timm_app.py
Slurm¶
The slurm and local_cwd schedulers use the current environment, so
pip and conda work as usual.
Next Steps¶
Explore the API Quick Reference for copy-pasteable recipes
Explore the torchx CLI and the Runner Python API
Review supported schedulers
Browse builtin components
See also
- Basic Concepts
Core concepts behind AppDef, Component, Runner, and Scheduler.
- .torchxconfig
Configuring scheduler options via
.torchxconfig.- Custom Components
Writing and registering your own components.