Rate this Page

Host & agents (control plane & mux)#

In the runtime, a host is the thing that owns “all the procs on this machine” and gives them a single front door. The Rust type looks like this:

// from hyperactor/src/host.rs

pub struct Host<M> {
    procs: HashSet<String>,
    frontend_addr: ChannelAddr,
    backend_addr: ChannelAddr,
    router: DialMailboxRouter,
    manager: M,
    service_proc: Proc,
    local_proc: Proc,
    frontend_rx: Option<ChannelRx<MessageEnvelope>>,
}

Visually, you can think of it like this:

                      ┌────────────┐
                  ┌───▶  proc *,1  │
                  │ #1└────────────┘
                  │
  ┌──────────┐    │   ┌────────────┐
  │   Host   │◀───┼───▶  proc *,2  │
 *└──────────┘#   │ #2└────────────┘
                  │
                  │   ┌────────────┐
                  └───▶  proc *,3  │
                    #3└────────────┘
  • * is the host’s frontend address (frontend_addr). This is the address other mesh participants know.

  • # is the host’s backend address (backend_addr). Procs talk to the host here.

  • #1, #2, #3 are the per-proc backend channels the host records in router: DialMailboxRouter

  • Each box proc *,N is a proc that is direct-addressed via the host — its id is essentially “proc at * named N”.

What the fields mean#

  • frontend_addr: the single, public entry point. Messages from the rest of the mesh arrive here.

  • procs: a set of proc names managed by this host.

  • router: DialMailboxRouter: the machinery that actually multiplexes/demultiplexes between * and the per-proc channels.

  • manager: M: the thing that can create and destroy procs on this host. In the real bootstrapped case this is a BootstrapProcManager; in tests it can be a local manager.

  • service_proc: the host’s system proc handle, so the host can participate in the same message world it is hosting.

  • local_proc: an additional local proc for in-process operations.

  • frontend_rx: the optional channel receiver for the frontend address (consumed when serving starts).

Why this matters for bootstrapping#

When, in chapter 4, we say “now ask each host to spawn a proc,” this is the piece that makes it possible. The host already has:

  1. a public address (*),

  2. a routing table for its existing procs,

  3. and a manager capable of creating new ones.

So the host-mesh agent can receive a “create proc” request over the mesh protocol and hand it to the host, and the host will add another box to the diagram above.


Code-level view#

At the control plane we have the mesh-facing actor:

// hyperactor_mesh/src/v1/host_mesh/mesh_agent.rs
pub struct HostAgent {
  host: Option<HostAgentMode>,
  created: HashMap<Name, ProcCreationState>,
}

The reason it’s an Option is that the agent can exist before (or after) the host is actually running.

The host field is one of two shapes:

// "How are we running this host?"
pub enum HostAgentMode {
    // Real OS process, uses BootstrapProcManager underneath.
    Process(Host<BootstrapProcManager>),

    // In-process/testing host, uses a local proc manager.
    Local(Host<LocalProcManager<ProcManagerSpawnFn>>),
}

Both variants wrap a Host<…>, and that Host is the thing we drew earlier as the mux:

// hyperactor/src/host.rs (simplified)
pub struct Host<M> {
    procs: HashSet<String>,
    frontend_addr: ChannelAddr,
    backend_addr: ChannelAddr,
    router: DialMailboxRouter,
    manager: M,             // e.g. BootstrapProcManager
    service_proc: Proc,
    local_proc: Proc,
    frontend_rx: Option<ChannelRx<MessageEnvelope>>,
}

So the layering from the code’s point of view is:

  1. HostAgent (actor you message over v1)

  2. → maybe a HostAgentMode

  3. → definitely a Host<...> once materialized

  4. → which, through its manager (e.g. BootstrapProcManager), owns/spawns the procs and does the */#n routing.

HostAgent message handling#

The agent is exported with exactly these handlers:

#[hyperactor::export(
    handlers = [
        resource::CreateOrUpdate<ProcSpec>,
        resource::Stop,
        resource::GetState<ProcState>,
        resource::GetRankStatus { cast = true },
        resource::List,
        ShutdownHost,
    ]
)]
pub struct HostAgent {
    host: Option<HostAgentMode>,
    created: HashMap<Name, ProcCreationState>,
    local_mesh_agent: OnceCell<anyhow::Result<ActorHandle<ProcAgent>>>,
}

So everything it does is one of those 6 messages.

1. CreateOrUpdate#

  • If we already have created[name], do nothing (idempotent).

  • Otherwise call host.spawn(name, ...) — process-backed hosts get a BootstrapProcConfig, local hosts get ().

  • Store { rank, created_result, stopped: false } in created.

2. Stop#

  • Look up created[name].

  • If it was successfully created, call host.terminate_proc(..., timeout) and mark it stopped = true.

  • Reply with a StatusOverlay for that rank (or empty if we never had it).

3. GetRankStatus#

  • Look up created[name].

  • If present, return that rank with Running / Stopped / Failed (depending on what we know).

  • Otherwise return NotExist.

4. GetState#

  • Same lookup, but return the richer state:

    • the proc’s direct id at the host,

    • the rank we used,

    • the proc’s own ProcAgent ref,

    • and any bootstrap/process status the host’s manager could provide.

5. List#

  • Return all the proc names that have been created on this host (the keys from the created map).

6. ShutdownHost#

  • Ack first so the caller can await.

  • Take the host out of self.

  • Call host.terminate_children(...) (process vs local path) with the provided timeout and concurrency.

Why this exists#

Host is local; HostAgent is the remote handle for it. Bootstrap code just sends CreateOrUpdate/Stop/GetState to the agent; the agent is the one that actually owns the Host and can spawn/stop procs. That’s why all handlers use the shared resource messages.

1. ProcSpec (what we tell the host to run)#

In all the examples above we sent resource::CreateOrUpdate<ProcSpec>, and in the code that really is what’s happening — but the current ProcSpec is intentionally very thin.

From hyperactor_mesh/src/resource.rs:

/// Spec for a host mesh agent to use when spawning a new proc.
#[derive(Clone, Debug, Serialize, Deserialize, Named, Default)]
pub(crate) struct ProcSpec {
    /// Config values to set on the spawned proc's global config,
    /// at the `ClientOverride` layer.
    pub(crate) client_config_override: Attrs,
}

So right now:

  • the spec is private (pub(crate)) and has exactly one field: client_config_override: Attrs;

  • the rank is not here — it’s on the outer message:

pub struct CreateOrUpdate<S> {
    pub name: Name,
    #[binding(include)]
    pub rank: Rank,
    pub spec: S,
}

What the HostAgent actually does matches this shape:

  • if the host is process-backed (HostAgentMode::Process(...)), it builds a BootstrapProcConfig using

  • the rank from CreateOrUpdate::<ProcSpec>, and

  • the client_config_override from ProcSpec, and passes that to host.spawn(...);

  • if the host is local (HostAgentMode::Local(...)), it just calls host.spawn(name, ()) and ignores the override.

Here is the bit of real code that does exactly that (abridged to just the decision):

// from hyperactor_mesh/src/v1/host_mesh/mesh_agent.rs (`impl Handler<resource::CreateOrUpdate<ProcSpec>> for HostAgent`)

let created = match host {
    HostAgentMode::Process(host) => {
        host.spawn(
            msg.name.to_string(),
            BootstrapProcConfig {
                create_rank: msg.rank.unwrap(),
                client_config_override: msg.spec.client_config_override.clone(),
            },
        )
        .await
    }
    HostAgentMode::Local(host) => {
        host.spawn(msg.name.to_string(), ()).await
    }
};

That’s why the current ProcSpec can stay small: the outer resource message carries the mesh-y things (name, rank), and the spec only has to carry the “what should this proc’s client config look like” part.

2. How the host actually spawns#

When the agent calls host.spawn(name, …), the host itself is not doing the OS-level work. The host delegates to its configured proc manager:

  • process-backed host → Host<BootstrapProcManager>

  • in-proc/test host → Host<LocalProcManager<...>>

The manager is the thing that can “make a proc real” (fork/spawn, run the bootstrap command, wire the backchannel) and hand the host the proc name so the host can add it to the procs: HashSet<String> table and expose it as ProcId(frontend_addr, name).

We’re not going to unpack the process-backed path here — that lives in “BootstrapProcManager (process-backed hosts)” where we can talk about commands, ready signals, and termination.

v1 bootstrap in one pass#

The reason the HostAgent has those five messages (create, stop, get-state, get-rank-status, shutdown) is that the v1 protocol treats “things on a host” as resources. A typical sequence is:

  1. Coordinator → hosts: send CreateOrUpdate<ProcSpec> to every host agent in the mesh (“each of you should have a proc called p0 with this rank/config”).

  2. Coordinator → hosts (later): send GetState<ProcState> (or GetRankStatus) to see which hosts actually brought that proc up and what address/command it got.

  3. Coordinator → hosts (teardown): send ShutdownHost to have each agent tell its host to terminate all children and drop the host.

Because everyone speaks this same resource shape — CreateOrUpdate<T>, GetState<T>, Stop, StopAll/ShutdownHost — the handlers on HostAgent all look the same, and the coordinator can fan the same message out to N hosts.