TorchX¶
TorchX is a universal job launcher for PyTorch applications. Define your job once and run it on any supported backend – from your laptop to Kubernetes to Slurm clusters – without rewriting configuration for each environment.
Why TorchX?
Write once, run anywhere. The same
torchx runcommand (orRunnercall) works across all schedulers. Switch from local development to production clusters by changing a single flag.No vendor lock-in. TorchX abstracts the scheduler, so your job definitions stay portable across Kubernetes, Slurm, AWS Batch, and more.
Batteries included. A built-in components library provides ready-made launchers for distributed training, inference, and common utilities – so you don’t start from scratch.
Zero runtime dependency. Your application has no dependency on TorchX at runtime. TorchX is only needed at launch time.
Tip
New to TorchX?
Follow the recommended reading order:
Quickstart – install, write a simple app, and launch it (10 min)
Quick Reference – the Python API at a glance (imports, types, recipes)
Basic Concepts – core concepts: AppDef, Component, Runner, Scheduler
Custom Components – write your own reusable component
Advanced Usage – register plugins and extend TorchX
In 1-2-3¶
Step 1. Install
pip install torchx[dev]
Step 2. Run Locally
import torchx.specs as specs
from torchx.runner import get_runner
app = specs.AppDef(
name="hello",
roles=[
specs.Role(
name="worker",
entrypoint="python",
image="/tmp",
args=["my_app.py", "Hello, localhost!"],
)
],
)
with get_runner() as runner:
app_handle = runner.run(app, scheduler="local_cwd")
print(runner.status(app_handle))
Or from the CLI:
torchx run --scheduler local_cwd utils.python --script my_app.py "Hello, localhost!"
Step 3. Run Remotely – only the scheduler argument changes:
with get_runner() as runner:
app_handle = runner.run(app, scheduler="kubernetes")
torchx run --scheduler kubernetes utils.python --script my_app.py "Hello, Kubernetes!"
Ecosystem¶
TorchX is part of the PyTorch ecosystem. It complements – rather than replaces – other PyTorch projects:
TorchElastic handles fault-tolerant distributed training within a job. TorchX launches the job itself and the built-in
dist.ddpcomponent uses TorchElastic under the hood.TorchServe serves models in production. TorchX can launch TorchServe instances on remote clusters.
TorchRec / TorchVision / TorchAudio provide domain-specific libraries. TorchX launches training and inference jobs that use them.
TorchX does not prescribe a training framework, model architecture, or data pipeline. It operates at the job-launching layer and works with any Python application. For workflow orchestration (DAGs of jobs), integrate TorchX with Airflow or Kubeflow Pipelines. For hyperparameter tuning, use Ax with TorchX as the trial launcher. See When to Use TorchX (and When Not To) for a detailed comparison with alternatives.
Documentation¶
API Reference
Works With¶
Schedulers
Workspaces
Runtime Library¶
Best Practices