TorchX¶

TorchX is a universal job launcher for PyTorch applications. Define your job once and run it on any supported backend – from your laptop to Kubernetes to Slurm clusters – without rewriting configuration for each environment.

Why TorchX?

Write once, run anywhere. The same torchx run command (or Runner call) works across all schedulers. Switch from local development to production clusters by changing a single flag.
No vendor lock-in. TorchX abstracts the scheduler, so your job definitions stay portable across Kubernetes, Slurm, AWS Batch, and more.
Batteries included. A built-in components library provides ready-made launchers for distributed training, inference, and common utilities – so you don’t start from scratch.
Zero runtime dependency. Your application has no dependency on TorchX at runtime. TorchX is only needed at launch time.

Tip

New to TorchX?

Follow the recommended reading order:

Quickstart – install, write a simple app, and launch it (10 min)
Quick Reference – the Python API at a glance (imports, types, recipes)
Basic Concepts – core concepts: AppDef, Component, Runner, Scheduler
Custom Components – write your own reusable component
Advanced Usage – register plugins and extend TorchX

In 1-2-3¶

Step 1. Install

pip install torchx[dev]

Step 2. Run Locally

import torchx.specs as specs
from torchx.runner import get_runner

app = specs.AppDef(
    name="hello",
    roles=[
        specs.Role(
            name="worker",
            entrypoint="python",
            image="/tmp",
            args=["my_app.py", "Hello, localhost!"],
        )
    ],
)

with get_runner() as runner:
    app_handle = runner.run(app, scheduler="local_cwd")
    print(runner.status(app_handle))

Or from the CLI:

torchx run --scheduler local_cwd utils.python --script my_app.py "Hello, localhost!"

Step 3. Run Remotely – only the scheduler argument changes:

with get_runner() as runner:
    app_handle = runner.run(app, scheduler="kubernetes")

torchx run --scheduler kubernetes utils.python --script my_app.py "Hello, Kubernetes!"

Ecosystem¶

TorchX is part of the PyTorch ecosystem. It complements – rather than replaces – other PyTorch projects:

TorchElastic handles fault-tolerant distributed training within a job. TorchX launches the job itself and the built-in dist.ddp component uses TorchElastic under the hood.
TorchServe serves models in production. TorchX can launch TorchServe instances on remote clusters.
TorchRec / TorchVision / TorchAudio provide domain-specific libraries. TorchX launches training and inference jobs that use them.

TorchX does not prescribe a training framework, model architecture, or data pipeline. It operates at the job-launching layer and works with any Python application. For workflow orchestration (DAGs of jobs), integrate TorchX with Airflow or Kubeflow Pipelines. For hyperparameter tuning, use Ax with TorchX as the trial launcher. See When to Use TorchX (and When Not To) for a detailed comparison with alternatives.

Documentation¶

API Reference

Guides

Works With¶

Schedulers

Workspaces

Docker

Runtime Library¶

Application (Runtime)

Best Practices

TorchX¶

In 1-2-3¶

Ecosystem¶

Documentation¶

Works With¶

Runtime Library¶

Docs

Tutorials

Resources