Shortcuts

CLI

Tip

The torchx CLI provides subcommands for launching jobs, querying schedulers, and managing running applications. It can be extended with custom subcommands via entry points.

Prerequisites: Quickstart (installation and first launch).

The torchx CLI is the primary way most users interact with TorchX. The Runner Python API provides the same capabilities programmatically (see torchx.runner).

Built-in Commands

Command

Purpose

run

Launch a component on a scheduler.

builtins

List all registered components.

runopts

Show scheduler-specific config options.

status

Check the status of a submitted job.

describe

Describe a submitted job (reconstruct its AppDef).

log

Fetch log lines for a running or completed job.

list

List jobs on a scheduler.

cancel

Cancel a running job.

delete

Delete a job definition from the scheduler.

configure

Manage .torchxconfig settings.

tracker

Query tracker backends for artifacts and metadata.

torchx run Key Flags

$ torchx run [--scheduler SCHED] [-cfg KEY=VAL,...] [--workspace PATH] COMPONENT [ARGS...]

Flag

Purpose

-s / --scheduler

Scheduler backend name (e.g. local_cwd, kubernetes, slurm).

-cfg / --scheduler_args

Comma-separated scheduler config key-value pairs (e.g. -cfg namespace=default,queue=gpu). Run torchx runopts to see available options per scheduler.

--workspace

Path to the local workspace directory. Overrides Role.workspace for role[0].

--dryrun

Print the scheduler request without submitting.

--wait

Block until the job reaches a terminal state.

--log

Tail logs after submission (implies --wait).

--tee_logs

Prefix log lines with replica identity (e.g. trainer/0) so you can distinguish which replica produced each line.

--parent_run_id

Optional parent run ID for experiment tracking. Groups related runs under a common identifier (propagated as TORCHX_PARENT_RUN_ID).

--stdin

Read run arguments as JSON from stdin instead of CLI flags. When set, most other CLI flags are disallowed.

Usage examples:

$ torchx run --scheduler local_cwd utils.python --script my_app.py
$ torchx run --scheduler kubernetes -cfg namespace=default dist.ddp -j 2x2 --script train.py
$ torchx runopts kubernetes
$ torchx status local_cwd://torchx/my_job_id
$ torchx log local_cwd://torchx/my_job_id trainer/0

Extending the CLI

Subclass SubCommand and register via the torchx.cli.cmds entry-point group. Implement two methods:

See Registering Custom CLI Commands in the Advanced Usage guide for a complete walkthrough with code examples.

Testing Your SubCommand

Construct an argparse.Namespace and call run() directly:

import argparse
import unittest

class CmdMyToolTest(unittest.TestCase):
    def test_run(self) -> None:
        cmd = CmdMyTool()
        parser = argparse.ArgumentParser()
        cmd.add_arguments(parser)
        args = parser.parse_args(["--config", "test.yaml", "app-123"])
        # cmd.run(args)  # call and assert side effects

See torchx/cli/test/ for tests of the built-in subcommands.

Common Pitfalls

  • Entry-point key becomes the subcommand name: Choose a short, descriptive name – it is the exact string users type after torchx.

  • Entry point targets a class, not a factory: Unlike schedulers and trackers, CLI entry points reference the SubCommand class itself.

  • Overriding built-in commands: If your key matches a built-in (e.g. run), your command replaces it entirely.

Components Library

Components are reusable job templates that the CLI discovers via entry points. Run torchx builtins to list all registered components, or torchx run COMPONENT --help to see the arguments for a specific component.

API Reference

The torchx CLI is a commandline tool around torchx.runner.Runner. It allows users to launch torchx.specs.AppDef directly onto one of the supported schedulers without authoring a pipeline (aka workflow). This is convenient for quickly iterating on the application logic without incurring both the technical and cognitive overhead of learning, writing, and dealing with pipelines.

Note

When in doubt use torchx --help.

Listing the builtin components

Most of the components under the torchx.components module are what the CLI considers “built-in” apps. Before you write your own component you should browse through the builtins to see if there is one that fits your needs already. If so, no need to even author an app spec!

$ torchx builtins
Found <n> builtin configs:
 1. metrics.tensorboard
 2. serve.torchserve
 3. utils.binary
 ... <omitted for brevity>

Listing the supported schedulers and arguments

To get a list of supported schedulers that you can launch your job into run:

$ torchx runopts
local_docker:
{ 'log_dir': { 'default': 'None',
               'help': 'dir to write stdout/stderr log files of replicas',
               'type': 'str'}}
local_cwd:
...
slurm:
...
kubernetes:
...

Running a component as a job

The run subcommand takes either one of:

  1. name of the builtin

    $ torchx run --scheduler <sched_name> utils.echo
    
  2. full python module path of the component function

    $ torchx run --scheduler <sched_name> torchx.components.utils.echo
    
  3. file path of the *.py file the defines the component along with the component function name in that file.

    $ cat ~/my_trainer_spec.py
    import torchx.specs as specs
    
    def my_trainer(foo: int, bar: str) -> specs.AppDef:
      <...spec file details omitted for brevity...>
    
    $ torchx run --scheduler <sched_name> ~/my_trainer_spec.py:my_trainer
    

Now that you have understood how to chose which app to launch, now it is time to see what parameters need to be passed. There are three sets of parameters:

  1. arguments to the run subcommand, see a list of them by running:

    $ torchx run --help
    
  2. arguments to the scheduler (--scheduler_args, also known as run_options or run_configs), each scheduler takes different args, to find out the args for a specific scheduler run (command for local_cwd scheduler shown below:

    $ torchx runopts local_cwd
    { 'log_dir': { 'default': 'None',
               'help': 'dir to write stdout/stderr log files of replicas',
               'type': 'str'}}
    
    # pass run options as comma-delimited k=v pairs
    $ torchx run --scheduler local_cwd --scheduler_args log_dir=/tmp ...
    
  3. arguments to the component (the app args are included here), this also depends on the component and can be seen with the --help string on the component

    $ torchx run --scheduler local_cwd utils.echo --help
    usage: torchx run echo.torchx [-h] [--msg MSG]
    
    Echos a message
    
    optional arguments:
    -h, --help  show this help message and exit
    --msg MSG   Message to echo
    

Putting everything together, running echo with the local_cwd scheduler:

$ torchx run --scheduler local_cwd --scheduler_args log_dir=/tmp utils.echo --msg "hello $USER"
=== RUN RESULT ===
torchx 2022-06-15 16:08:57 INFO     Log files located in: /tmp/torchx/echo-crls3hcpwjmhc/echo/0
local_cwd://torchx/echo-crls3hcpwjmhc

By default the run subcommand does not block for the job to finish, instead it simply schedules the job on the specified scheduler and prints an app handle which is a URL of the form: $scheduler_name://torchx/$job_id. Keep note of this handle since this is what you’ll need to provide to other subcommands to identify your job.

Note

If the --scheduler option is not provided, then it defaults to the scheduler backend default which points to local_cwd. To change the default scheduler, see: Registering Custom Schedulers.

Inspecting what will run (dryrun)

When you are authoring or debugging a component, you may want to find out and inspect both the scheduler request object that the runner submits as well as the AppDef object that is created by the component function. To do this, use the --dryrun option to the run subcommand:

$ torchx run --dryrun utils.echo --msg hello_world
=== APPLICATION ===
{ 'metadata': {},
  'name': 'echo',
  'roles': [ { 'args': ['hello_world'],
               'entrypoint': '/bin/echo',
               'env': {},
               'image': '/tmp',
               'max_retries': 0,
               'name': 'echo',
               'num_replicas': 1,
               'port_map': {},
               'resource': { 'capabilities': {},
                             'cpu': -1,
                             'gpu': -1,
                             'memMB': -1},
               'retry_policy': <RetryPolicy.APPLICATION: 'APPLICATION'>}]}
=== SCHEDULER REQUEST ===
PopenRequest(
    app_id='echo_c944ffb2',
    log_dir='/tmp/torchx_asmtmyqj/torchx_kiuk/echo_c944ffb2',
    role_params={
        'echo': [
            ReplicaParam(
                args=['/bin/echo', 'hello_world'],
                env={'TORCHELASTIC_ERROR_FILE': '/tmp/torchx_asmtmyqj/torchx_kiuk/echo_c944ffb2/echo/0/error.json'},
                stdout=None,
                stderr=None)
            ]
        },
    role_log_dirs={'echo': ['/tmp/torchx_asmtmyqj/torchx_kiuk/echo_c944ffb2/echo/0']})

Note

The scheduler request print out will look different based on the scheduler type. The example above is a faux request since the scheduler is a local scheduler which simply uses subprocess.Popen to simulate replicas as a POSIX process. Nevertheless the scheduler request contains valuable insight into what the runner translates the AppDef to for a particular scheduler backend.

Describing and querying the status of a job

The describe subcommand is essentially the inverse of the run command. That is, it prints the app spec given an app_handle.

$ torchx describe <app handle>

Important

The describe command attempts to recreate an app spec by querying the scheduler for the job description. So what you see printed is not always 100% the exact same app spec that was given to the run. The extent to which the runner can recreate the app spec depends on numerous factors such as how descriptive the scheduler’s describe_job API is as well as whether there were fields in the app spec that were ignored when submitting the job to the scheduler because the scheduler does not have support for such a parameter/functionality. NEVER rely on the describe API as a storage function for your app spec. It is simply there to help you spot check things.

To get the status of a running job:

$ torchx status <app_handle> # prints status for all replicas and roles
$ torchx status --role trainer <app_handle> # filters it down to the trainer role

Filtering by --role is useful for large jobs that have multiple roles.

Viewing Logs

Note

This functionality depends on how long your scheduler setup retains logs TorchX DOES NOT archive logs on your behalf, but rather relies on the scheduler’s get_log API to obtain the logs. Refer to your scheduler’s user manual to setup log retention properly.

The log subcommand is a simple wrapper around the scheduler’s get_log API to let you pull/print the logs for all replicas and roles from one place. It also lets you pull replica or role specific logs. Below are a few log access patterns that are useful and self explanatory

$ torchx log <app_handle>
Prints all logs from all replicas and roles (each log line is prefixed with role name and replica id)

$ torchx log --tail <app_handle>
If the job is still running tail the logs

$ torchx log --regex ".*Exception.*" <app_handle>
regex filter to exceptions

$ torchx log <app_handle>/<role>
$ torchx log <app_handle>/trainer
pulls all logs for the role trainer

$ torchx log <app_handle>/<role_name>/<replica_id>
$ torchx log <app_handle>/trainer/0,1
only pulls trainer 0 and trainer 1 (node not rank) logs

Note

Some schedulers do not support server-side regex filters. In this case the regex filter is applied on the client-side, meaning the full logs will have to be passed through the client. This may be very taxing to the local host. Please use your best judgment when using the logs API.

Listing Jobs

The list subcommand lets you list app handles and statuses of apps launched on a scheduler. You can then use the app handles to further describe the app, view the logs, etc.

$ torchx list -s <scheduler_name>

$ torchx list -s kubernetes
APP HANDLE                                          APP STATUS
--------------------------------------------------  ------------
kubernetes://torchx/default:trainer-a5qvfhe1hyq2w   SUCCEEDED
kubernetes://torchx/default:trainer-d796ei2tdtest   SUCCEEDED
kubernetes://torchx/default:trainer-em0iao2m90000   FAILED
kubernetes://torchx/default:trainer-ew33oxmdg0123   SUCCEEDED
class torchx.cli.cmd_base.SubCommand[source]

Base sub command class, all subcommands should implement this base class

abstract add_arguments(subparser: ArgumentParser) None[source]

Adds the arguments to this sub command

abstract run(args: Namespace) None[source]

Runs the sub command. Parsed arguments are available as args.

See also

Advanced Usage

Entry-point registration for custom CLI commands, trackers, schedulers, and components.

Docs

Access comprehensive developer documentation for PyTorch

View Docs

Tutorials

Get in-depth tutorials for beginners and advanced developers

View Tutorials

Resources

Find development resources and get your questions answered

View Resources