torchx.runner¶

Submits AppDef jobs to schedulers.

The runner takes an AppDef (the result of evaluating a component function) along with a scheduler name and run config, and submits it as a job.

from torchx.runner import get_runner

with get_runner() as runner:
    app_handle = runner.run(app, scheduler="kubernetes", cfg=cfg)
    status = runner.status(app_handle)
    print(status)

The Runner submits, monitors, and manages jobs. Use get_runner() to create one with all registered schedulers.

Key methods:

run() / run_component() – submit a job
status() – poll current state
wait() – block until terminal state
cancel() – request cancellation
delete() – remove a job definition from the scheduler
log_lines() – stream log output
list() – list jobs on a scheduler
dryrun() – preview what would be submitted without submitting
schedule() – submit a previously dry-run request (allows request mutation)

Scheduler instances are created lazily on first use. Use the Runner as a context manager for automatic cleanup.

See Quick Reference for copy-pasteable recipes.

torchx.runner.get_runner(name: str | None = None, component_defaults: dict[str, dict[str, str]] | None = None, **scheduler_params: Any) → Runner[source]¶

Creates a Runner with all registered schedulers.

with get_runner() as runner:
    app_handle = runner.run(app, scheduler="kubernetes", cfg=cfg)
    print(runner.status(app_handle))

Parameters:: scheduler_params – extra kwargs passed to all scheduler constructors.

class torchx.runner.Runner(name: str = '', scheduler_factories: dict[str, torchx.schedulers.SchedulerFactory] | None = None, component_defaults: dict[str, dict[str, str]] | None = None, scheduler_params: dict[str, object] | None = None)[source]¶

Submits, monitors, and manages AppDef jobs.

Use get_runner() to create an instance with all registered schedulers.

>>> from torchx.runner import get_runner
>>> runner = get_runner()
>>> runner.scheduler_backends()  
['local_cwd', 'local_docker', 'slurm', 'kubernetes', ...]

cancel(app_handle: str) → None[source]¶: Requests cancellation. The app transitions to CANCELLED asynchronously.

Convenience function around the scheduler’s runopts.cfg_from_str() method.

Usage:

from torchx.runner import get_runner

runner = get_runner()
cfg = runner.cfg_from_str("local_cwd", "log_dir=/tmp/foobar", "prepend_cwd=True")
assert cfg == {"log_dir": "/tmp/foobar", "prepend_cwd": True, "auto_set_cuda_visible_devices": False}

close() → None[source]¶: Closes the runner and all scheduler instances. Safe to call multiple times.

delete(app_handle: str) → None[source]¶: Deletes the app from the scheduler.

describe(app_handle: str) → torchx.specs.api.AppDef | None[source]¶

Reconstructs the AppDef from the scheduler.

Completeness is scheduler-dependent. Returns None if the app no longer exists.

Returns what would be submitted without actually submitting.

The returned AppDryRunInfo can be print()-ed for inspection or passed to schedule().

dryrun_component(component: str, component_args: list[str] | dict[str, Any], scheduler: str, cfg: Optional[Mapping[str, str | int | float | bool | list[str] | dict[str, str] | None]] = None, workspace: torchx.specs.api.Workspace | str | None = None, parent_run_id: str | None = None) → AppDryRunInfo[source]¶: Like run_component() but returns the request without submitting.

Lists jobs on the scheduler.

Parameters:: cfg – scheduler config, used by some schedulers for backend routing.

log_lines(app_handle: str, role_name: str, k: int = 0, regex: str | None = None, since: datetime.datetime | None = None, until: datetime.datetime | None = None, should_tail: bool = False, streams: torchx.schedulers.api.Stream | None = None) → Iterable[str][source]¶

Returns an iterator over log lines for the k-th replica of a role.

Important

k is the node (host) id, NOT the worker rank.

Warning

Completeness is scheduler-dependent. Lines may be partial or missing if logs have been purged. Do not use this for programmatic output parsing.

Lines include trailing whitespace (\n). Use print(line, end="") to avoid double newlines.

Parameters:

k – replica (node) index
regex – optional filter pattern
since – start cursor (scheduler-dependent)
until – end cursor (scheduler-dependent)

Submits an AppDef or returns its dry-run info.

# Submit
handle = runner.run(app, "mkube", cfg=cfg)

# Dryrun — inspect without submitting
info = runner.run(app, "mkube", cfg=cfg, dryrun=True)
print(info)

Parameters:: dryrun – If True, only validate and render the request without submitting. Returns AppDryRunInfo. If False (default), submit and return the AppHandle.

Resolves and runs a named component.

component resolution order (high → low):

User-registered torchx.components entry points
Builtins relative to torchx.components (e.g. "dist.ddp")
File-based path/to/file.py:function_name

schedule(dryrun_info: AppDryRunInfo) → str[source]¶

Submits a previously dry-run request, allowing request mutation.

dryrun_info = runner.dryrun(app, scheduler="kubernetes", cfg)
dryrun_info.request.foo = "bar"  # mutate the raw request
app_handle = runner.schedule(dryrun_info)

Warning

Use sparingly. Overwriting many raw scheduler fields may cause your usage to diverge from TorchX’s supported API.

scheduler_backends() → list[str][source]¶: Returns all registered scheduler backend names.

scheduler_run_opts(scheduler: str) → runopts[source]¶: Returns the runopts for the given scheduler.

status(app_handle: str) → torchx.specs.api.AppStatus | None[source]¶: Returns app status, or None if the app no longer exists.

stop(app_handle: str) → None[source]¶: Deprecated since version Use: cancel() instead.

wait(app_handle: str, wait_interval: float = 10) → torchx.specs.api.AppStatus | None[source]¶

Blocks until the app reaches a terminal state.

Parameters:: wait_interval – seconds between status polls

torchx.runner¶

Docs

Tutorials

Resources