Rate this Page

§5 Bootstrapping from Python#

So far we described the Rust side: there is a host, the host has a HostAgent, and we send CreateOrUpdate<ProcSpec> etc. That’s the control plane.

Most users won’t do that by hand — they’ll write Python like this:

import asyncio

from monarch._src.actor.host_mesh import this_host
from monarch._src.actor.proc_mesh import ProcMesh  # Optional, for typing
from monarch._src.actor.actor import Actor
from monarch._src.actor.endpoint import endpoint

class Counter(Actor):
  ...

def train_with_mesh():

    mesh = this_host().spawn_procs(per_host={"gpus": 2})
    counter = mesh.spawn("counter", Counter, 1)

   ...

Getting a host in Python (this_host()this_proc()context())#

When you write code like:

from monarch._src.actor.host_mesh import this_host

host = this_host()

there’s a bootstrap under it. Here’s what actually happens.

1. this_host() reads the host mesh off the current proc.#

From monarch/_src/actor/host_mesh.py:

def this_host() -> "HostMesh":
    """
    The current machine.

    This is just shorthand for looking it up via the context
    """
    return this_proc().host_mesh

So: this_host() doesn’t build a host. That means we have to look at this_proc().

2. this_proc() pulls the proc mesh off the current context#

From the same file:

def this_proc() -> "ProcMesh":
    """
    The current singleton process that this specific actor is
    running on
    """
    return context().actor_instance.proc

So now we’re down to the real root: context(). Everything hangs off of that.

3. context() — create (once) or return (later) the runtime context#

From monarch/_src/actor/actor_mesh.py:

_context: contextvars.ContextVar[Context] = contextvars.ContextVar(
    "monarch.actor_mesh._context"
)

and:

def context() -> Context:
    c = _context.get(None)
    if c is None:
        c = Context._root_client_context() # (1) ask Rust for a bare context
        _context.set(c)

        from monarch._src.actor.host_mesh import create_local_host_mesh
        from monarch._src.actor.proc_mesh import _get_controller_controller

        c.actor_instance.proc_mesh = _root_proc_mesh.get() # (2) give it a proc mesh
        _this_host_for_fake_in_process_host.get() # (3) make sure a host exists
        c.actor_instance._controller_controller = _get_controller_controller()[1]  # (4) wire control plane
    return c

So the logic is:

  1. First call: no context yet → build one.

  2. Later calls: return the same one from the ContextVar.

The interesting part is step (1) above — Context._root_client_context() — because that’s where Python hands off to Rust.

4. What Context._root_client_context() does (Rust side)#

The Rust in context.rs:

#[staticmethod]
fn _root_client_context(py: Python<'_>) -> PyResult<PyContext> {
    let _guard = runtime::get_tokio_runtime().enter();
    let instance: PyInstance = global_root_client().into();
    Ok(PyContext {
        instance: instance.into_pyobject(py)?.into(),
        rank: Extent::unity().point_of_rank(0).unwrap(),
    })
}

What matters is the call to global_root_client(). That function, on the Rust side, basically does this:

pub fn global_root_client() -> &'static Instance<()> {
    static GLOBAL_INSTANCE: OnceLock<(Instance<()>, ActorHandle<()>)> = OnceLock::new();
    &GLOBAL_INSTANCE.get_or_init(|| {
        // 1. Make a direct proc for the client to live in.
        let client_proc = Proc::direct_with_default(
            ChannelAddr::any(default_transport()),
            "mesh_root_client_proc".into(),
            router::global().clone().boxed(),
        ).unwrap();

        // 2. Register that proc in the *global* router so messages can reach it.
        router::global().bind(
            client_proc.proc_id().clone().into(),
            client_proc.clone(),
        );

        // 3. Start an actual actor instance in that proc, called "client".
        let (client, handle) = client_proc.instance("client").expect("root instance create");

        (client, handle)
    }).0
}

So when _root_client_context() runs, it is really:

  1. Ensuring there is a single, global, direct-addressed proc called “mesh_root_client_proc”.

  2. Putting that proc in the global router.

  3. Spawning a “client” actor in it.

  4. Wrapping that actor as a Python PyContext and giving it rank 0.

Notice what it doesn’t do: it does not attach a proc mesh or a host mesh. Those Python-only fields are still None at this point.

5. Python fills in the missing pieces#

That’s why, back in Python, right after calling the Rust function, we do three extra things:

c.actor_instance.proc_mesh = _root_proc_mesh.get()
_this_host_for_fake_in_process_host.get()
c.actor_instance._controller_controller = _get_controller_controller()[1]

Here’s what each does:

  1. _root_proc_mesh: _Lazy["ProcMesh"] = _Lazy(_init_root_proc_mesh) Defined as:

def _init_root_proc_mesh() -> "ProcMesh":
    from monarch._src.actor.host_mesh import fake_in_process_host

    return fake_in_process_host()._spawn_nonblocking(
        name="root_client_proc_mesh",
        per_host=Extent([], []),
        setup=None,
        _attach_controller_controller=False,
    )

So this:

  • makes a fake in-process host,

  • spawns one proc on it,

  • that proc mesh is stored as context().actor_instance.proc_mesh. Later, when you call this_proc() (which reads context().actor_instance.proc), you’re really just getting a slice of that stored proc_mesh.

  1. _this_host_for_fake_in_process_host: _Lazy["HostMesh"] = _Lazy(_init_this_host_for_fake_in_process_host) Defined as:

def _init_this_host_for_fake_in_process_host() -> "HostMesh":
    from monarch._src.actor.host_mesh import create_local_host_mesh
    return create_local_host_mesh()

This is the lazy “make me a host mesh” step. It just calls create_local_host_mesh(...) from the v1 Python bindings.

We get into what that does in detail in “Python create_local_host_mesh and Rust bootstrap” (§ below), so here we just say: this line is what actually spins up the local v1 host mesh using the same Rust path as the canonical bootstrap.

  1. _get_controller_controller()[1] And we stash the control-plane actor into c.actor_instance._controller_controller so later spawns have somewhere to go. We aren’t going to unpack that here.

  2. Now this_proc() / this_host() work

After that first context() run:

  • context().actor_instance.proc is set → so this_proc() returns a real ProcMesh

  • after the first context() run, the proc mesh you get (context().actor_instance.proc) was created from a host mesh, so it already carries a host_mesh reference — that’s why this_host() can just do this_proc().host_mesh.

So the original Python snippet:

mesh = this_host().spawn_procs(per_host={"gpus": 2})
counter = mesh.spawn("counter", Counter, 1)

works because:

  1. this_host() → got a HostMesh that Python created during context() bootstrap

  2. spawn_procs(...) → asks that host mesh (which is powered by the Rust v1 host mesh) to create procs

  3. mesh.spawn(...) → now that you have a ProcMesh, you can put actors on it

Python create_local_host_mesh and Rust bootstrap#

This note shows that calling create_local_host_mesh(...) in Python ends up driving the same Rust v1 host/agent/bootstrap path we described for the canonical Rust example.

1. Python entry point#

def create_local_host_mesh(
    extent: Optional[Extent] = None, env: Optional[Dict[str, str]] = None
) -> "HostMesh":
    cmd, args, bootstrap_env = _get_bootstrap_args()
    if env is not None:
        bootstrap_env.update(env)

    return HostMesh.allocate_nonblocking(
        "local_host",
        extent if extent is not None else Extent([], []),
        ProcessAllocator(cmd, args, bootstrap_env),
        bootstrap_cmd=_bootstrap_cmd(),
    )
  • _get_bootstrap_args() = “what command/env do we use to start a hyperactor proc?”

  • we wrap that in a ProcessAllocator(...)

  • we tell the Rust side to allocate_nonblocking(...) a v1 HostMesh using that allocator.

2. Hand-off to Rust#

The Python classmethod does:

await HyHostMesh.allocate_nonblocking(
    context().actor_instance._as_rust(),
    await alloc._hy_alloc,
    name,
    bootstrap_cmd,
)

It passes the allocation and (optionally) the bootstrap command straight to the Rust v1 HostMesh::allocate(...), via the PyHostMesh::allocate_nonblocking(...) binding. That’s the same Rust entry point the canonical bootstrap uses — just exposed to Python.

#[pymethods]
impl PyHostMesh {
    #[classmethod]
    fn allocate_nonblocking(
        _cls: &Bound<'_, PyType>,
        instance: &PyInstance,
        alloc: &mut PyAlloc,
        name: String,
        bootstrap_params: Option<PyBootstrapCommand>,
    ) -> PyResult<PyPythonTask> {
        let bootstrap_params =
            bootstrap_params.map_or_else(|| alloc.bootstrap_command.clone(), |b| Some(b.to_rust()));
        let alloc = match alloc.take() {
            Some(alloc) => alloc,
            None => {
                return Err(PyException::new_err(
                    "Alloc object already used".to_string(),
                ));
            }
        };
        let instance = instance.clone();
        PyPythonTask::new(async move {
            let mesh = instance_dispatch!(instance, async move |cx_instance| {
                HostMesh::allocate(cx_instance, alloc, &name, bootstrap_params).await
            })
            .map_err(|err| PyException::new_err(err.to_string()))?;
            Ok(Self::new_owned(mesh))
        })
    }
}

(This returns a Python task because all v1 Python bindings wrap Rust async in a small bridge. See Appendix: Python async bridge (pytokio).)

HostMesh::allocate(...) is the entry point that stands up the host, creates its system proc, spawns the HostAgent, and makes it reachable — it’s the same path we used in the Rust canonical example.