Host & agents (control plane & mux)#
In the runtime, a host is the thing that owns “all the procs on this machine” and gives them a single front door. The Rust type looks like this:
// from hyperactor/src/host.rs
pub struct Host<M> {
procs: HashSet<String>,
frontend_addr: ChannelAddr,
backend_addr: ChannelAddr,
router: DialMailboxRouter,
manager: M,
service_proc: Proc,
local_proc: Proc,
frontend_rx: Option<ChannelRx<MessageEnvelope>>,
}
Visually, you can think of it like this:
┌────────────┐
┌───▶ proc *,1 │
│ #1└────────────┘
│
┌──────────┐ │ ┌────────────┐
│ Host │◀───┼───▶ proc *,2 │
*└──────────┘# │ #2└────────────┘
│
│ ┌────────────┐
└───▶ proc *,3 │
#3└────────────┘
*is the host’s frontend address (frontend_addr). This is the address other mesh participants know.#is the host’s backend address (backend_addr). Procs talk to the host here.#1,#2,#3are the per-proc backend channels the host records inrouter: DialMailboxRouterEach box
proc *,Nis a proc that is direct-addressed via the host — its id is essentially “proc at*namedN”.
What the fields mean#
frontend_addr: the single, public entry point. Messages from the rest of the mesh arrive here.procs: a set of proc names managed by this host.router: DialMailboxRouter: the machinery that actually multiplexes/demultiplexes between*and the per-proc channels.manager: M: the thing that can create and destroy procs on this host. In the real bootstrapped case this is aBootstrapProcManager; in tests it can be a local manager.service_proc: the host’s system proc handle, so the host can participate in the same message world it is hosting.local_proc: an additional local proc for in-process operations.frontend_rx: the optional channel receiver for the frontend address (consumed when serving starts).
Why this matters for bootstrapping#
When, in chapter 4, we say “now ask each host to spawn a proc,” this is the piece that makes it possible. The host already has:
a public address (
*),a routing table for its existing procs,
and a manager capable of creating new ones.
So the host-mesh agent can receive a “create proc” request over the mesh protocol and hand it to the host, and the host will add another box to the diagram above.
Code-level view#
At the control plane we have the mesh-facing actor:
// hyperactor_mesh/src/v1/host_mesh/mesh_agent.rs
pub struct HostAgent {
host: Option<HostAgentMode>,
created: HashMap<Name, ProcCreationState>,
}
The reason it’s an Option is that the agent can exist before (or after) the host is actually running.
The host field is one of two shapes:
// "How are we running this host?"
pub enum HostAgentMode {
// Real OS process, uses BootstrapProcManager underneath.
Process(Host<BootstrapProcManager>),
// In-process/testing host, uses a local proc manager.
Local(Host<LocalProcManager<ProcManagerSpawnFn>>),
}
Both variants wrap a Host<…>, and that Host is the thing we drew earlier as the mux:
// hyperactor/src/host.rs (simplified)
pub struct Host<M> {
procs: HashSet<String>,
frontend_addr: ChannelAddr,
backend_addr: ChannelAddr,
router: DialMailboxRouter,
manager: M, // e.g. BootstrapProcManager
service_proc: Proc,
local_proc: Proc,
frontend_rx: Option<ChannelRx<MessageEnvelope>>,
}
So the layering from the code’s point of view is:
HostAgent(actor you message over v1)→ maybe a
HostAgentMode→ definitely a
Host<...>once materialized→ which, through its manager (e.g.
BootstrapProcManager), owns/spawns the procs and does the*/#nrouting.
HostAgent message handling#
The agent is exported with exactly these handlers:
#[hyperactor::export(
handlers = [
resource::CreateOrUpdate<ProcSpec>,
resource::Stop,
resource::GetState<ProcState>,
resource::GetRankStatus { cast = true },
resource::List,
ShutdownHost,
]
)]
pub struct HostAgent {
host: Option<HostAgentMode>,
created: HashMap<Name, ProcCreationState>,
local_mesh_agent: OnceCell<anyhow::Result<ActorHandle<ProcAgent>>>,
}
So everything it does is one of those 6 messages.
1. CreateOrUpdate#
If we already have
created[name], do nothing (idempotent).Otherwise call
host.spawn(name, ...)— process-backed hosts get aBootstrapProcConfig, local hosts get().Store
{ rank, created_result, stopped: false }increated.
2. Stop#
Look up
created[name].If it was successfully created, call
host.terminate_proc(..., timeout)and mark itstopped = true.Reply with a
StatusOverlayfor that rank (or empty if we never had it).
3. GetRankStatus#
Look up
created[name].If present, return that rank with
Running/Stopped/Failed(depending on what we know).Otherwise return
NotExist.
4. GetState#
Same lookup, but return the richer state:
the proc’s direct id at the host,
the rank we used,
the proc’s own
ProcAgentref,and any bootstrap/process status the host’s manager could provide.
5. List#
Return all the proc names that have been created on this host (the keys from the
createdmap).
6. ShutdownHost#
Ack first so the caller can await.
Take the host out of
self.Call
host.terminate_children(...)(process vs local path) with the provided timeout and concurrency.
Why this exists#
Host is local; HostAgent is the remote handle for it. Bootstrap code just sends CreateOrUpdate/Stop/GetState to the agent; the agent is the one that actually owns the Host and can spawn/stop procs. That’s why all handlers use the shared resource messages.
1. ProcSpec (what we tell the host to run)#
In all the examples above we sent resource::CreateOrUpdate<ProcSpec>, and in the code that really is what’s happening — but the current ProcSpec is intentionally very thin.
From hyperactor_mesh/src/resource.rs:
/// Spec for a host mesh agent to use when spawning a new proc.
#[derive(Clone, Debug, Serialize, Deserialize, Named, Default)]
pub(crate) struct ProcSpec {
/// Config values to set on the spawned proc's global config,
/// at the `ClientOverride` layer.
pub(crate) client_config_override: Attrs,
}
So right now:
the spec is private (
pub(crate)) and has exactly one field:client_config_override: Attrs;the rank is not here — it’s on the outer message:
pub struct CreateOrUpdate<S> {
pub name: Name,
#[binding(include)]
pub rank: Rank,
pub spec: S,
}
What the HostAgent actually does matches this shape:
if the host is process-backed (
HostAgentMode::Process(...)), it builds aBootstrapProcConfigusingthe rank from
CreateOrUpdate::<ProcSpec>, andthe
client_config_overridefromProcSpec, and passes that tohost.spawn(...);if the host is local (
HostAgentMode::Local(...)), it just callshost.spawn(name, ())and ignores the override.
Here is the bit of real code that does exactly that (abridged to just the decision):
// from hyperactor_mesh/src/v1/host_mesh/mesh_agent.rs (`impl Handler<resource::CreateOrUpdate<ProcSpec>> for HostAgent`)
let created = match host {
HostAgentMode::Process(host) => {
host.spawn(
msg.name.to_string(),
BootstrapProcConfig {
create_rank: msg.rank.unwrap(),
client_config_override: msg.spec.client_config_override.clone(),
},
)
.await
}
HostAgentMode::Local(host) => {
host.spawn(msg.name.to_string(), ()).await
}
};
That’s why the current ProcSpec can stay small: the outer resource message carries the mesh-y things (name, rank), and the spec only has to carry the “what should this proc’s client config look like” part.
2. How the host actually spawns#
When the agent calls host.spawn(name, …), the host itself is not doing the OS-level work. The host delegates to its configured proc manager:
process-backed host →
Host<BootstrapProcManager>in-proc/test host →
Host<LocalProcManager<...>>
The manager is the thing that can “make a proc real” (fork/spawn, run the bootstrap command, wire the backchannel) and hand the host the proc name so the host can add it to the procs: HashSet<String> table and expose it as ProcId(frontend_addr, name).
We’re not going to unpack the process-backed path here — that lives in “BootstrapProcManager (process-backed hosts)” where we can talk about commands, ready signals, and termination.
v1 bootstrap in one pass#
The reason the HostAgent has those five messages (create, stop, get-state, get-rank-status, shutdown) is that the v1 protocol treats “things on a host” as resources. A typical sequence is:
Coordinator → hosts: send
CreateOrUpdate<ProcSpec>to every host agent in the mesh (“each of you should have a proc calledp0with this rank/config”).Coordinator → hosts (later): send
GetState<ProcState>(orGetRankStatus) to see which hosts actually brought that proc up and what address/command it got.Coordinator → hosts (teardown): send
ShutdownHostto have each agent tell its host to terminate all children and drop the host.
Because everyone speaks this same resource shape — CreateOrUpdate<T>, GetState<T>, Stop, StopAll/ShutdownHost — the handlers on HostAgent all look the same, and the coordinator can fan the same message out to N hosts.