Expand description
Mesh-level admin surface for topology introspection and reference walking.
This module defines MeshAdminAgent, an actor that exposes a
uniform, reference-based HTTP API over an entire host mesh. Every
addressable entity in the mesh is represented as a NodePayload
and resolved via typed NodeRef references (parsed from HTTP
path strings at the request boundary).
Incoming HTTP requests are bridged into the actor message loop
using ResolveReferenceMessage, ensuring that all topology
resolution and data collection happens through actor messaging.
The agent fans out to HostAgent instances to fetch host,
proc, and actor details, then normalizes them into a single
tree-shaped model (NodeProperties + children references)
suitable for topology-agnostic clients such as the admin TUI.
§Schema strategy
The external API contract is schema-first: the JSON Schema
(Draft 2020-12) served at GET /v1/schema is the
authoritative definition of the response shape. The error
envelope schema is at GET /v1/schema/error.
Schema and OpenAPI are derived from the HTTP boundary DTO types
in crate::introspect::dto (NodePayloadDto,
NodePropertiesDto, FailureInfoDto) via
schemars::JsonSchema. The domain types (NodePayload,
NodeProperties, FailureInfo) do not carry JsonSchema —
they own the typed internal model; the DTOs own the wire
contract.
This follows the “Admin Gateway Pattern” RFC (doc): schema is the product; transports and tooling are projections.
§Schema generation pipeline
#[derive(JsonSchema)]onNodePayloadDto,NodePropertiesDto,FailureInfoDto,ApiError,ApiErrorEnvelope.schemars::schema_for!(T)produces aSchemavalue at runtime (Draft 2020-12).- The
serve_schema/serve_error_schemahandlers inject a$idfield (SC-4) and serve the result as JSON. - Snapshot tests in
introspect::testscompare the raw schemars output (without$id) against checked-in golden files to detect drift (SC-2). - Validation tests construct domain payloads, convert to DTOs, and confirm the serialized DTOs pass schema validation (SC-3).
§Regenerating snapshots
After intentional changes to the DTO types
(NodePayloadDto, NodePropertiesDto, FailureInfoDto),
ApiError, or ApiErrorEnvelope, regenerate the golden
files:
buck run fbcode//monarch/hyperactor_mesh:generate_api_artifacts \
@fbcode//mode/dev-nosan -- \
fbcode/monarch/hyperactor_mesh/src/testdataOr via cargo:
cargo run -p hyperactor_mesh --bin generate_api_artifacts -- \
hyperactor_mesh/src/testdataThen re-run tests to confirm the new snapshot passes.
§Schema invariants (SC-*)
- SC-1 (schema-derived): Schema is derived from the DTO
types via
schemars::JsonSchema, not hand-written. - SC-2 (schema-snapshot-stability): Schema changes must be explicit — a snapshot test catches unintentional drift.
- SC-3 (schema-payload-conformance): Domain payloads converted to DTOs validate against the generated schema.
- SC-4 (schema-version-identity): Served schemas carry a
$idtied to the API version (e.g.https://monarch.meta.com/schemas/v1/node_payload). - SC-5 (route-precedence): Literal schema routes are
matched by specificity before the
{*reference}wildcard (axum 0.8 specificity-based routing).
Note on ApiError.details: the derived schema is maximally
permissive for details (any valid JSON). This is intentional
for v1 — details is a domain-specific escape hatch.
Consumers must not assume a fixed shape.
§Introspection visibility policy
Admin tooling only displays introspectable nodes: entities
that are reachable via actor messaging and respond to
IntrospectMessage. Infrastructure procs that are
non-routable are intentionally opaque to introspection and
are omitted from the navigation graph.
§Definitions
Routable — an entity is routable if the system can address it
via the routing layer and successfully deliver a message to it
using a Reference / ActorId (i.e., there exists a live mailbox
sender reachable through normal routing). Practical test: “can I
send IntrospectMessage::Query to it and get a reply?”
Non-routable — an entity is non-routable if it has no
externally reachable mailbox sender in the routing layer, so
message delivery is impossible by construction (even if you know
its name). Examples: hyperactor_runtime[0], mailbox_server[N],
local[N] — these use PanickingMailboxSender and are never
bound to the router.
Introspectable — tooling can obtain a NodePayload for this
node by sending IntrospectMessage to a routable actor.
Opaque — the node exists but is not introspectable via messaging; tooling cannot observe it through the introspection protocol.
§Proc visibility
A proc is not directly introspected; actors are. Tooling
synthesizes proc-level nodes by grouping introspectable actors by
ProcId.
A proc is visible iff there exists at least one actor on that proc
whose ActorId is deliverable via the routing layer (i.e., the
actor has a bound mailbox sender reachable through normal routing)
and responds to IntrospectMessage.
The rule is: if an entity is routable via the mesh routing layer
(i.e., tooling can deliver IntrospectMessage::Query to one of its
actors), then it is introspectable and appears in the admin graph.
§Navigation identity invariants (NI-*)
Every NodePayload in the topology tree satisfies:
-
NI-1 (identity = reference): A node’s
identity: NodeRefmust correspond to the reference used to resolve it. The display form ofidentityround-trips throughNodeRef::from_str. -
NI-2 (parent = containment parent): A node’s
parent: Option<NodeRef>records its canonical containment parent, not the inverse of every navigation edge. Specifically: root →None, host →Root, proc →Host(…), actor →Proc(…). An actor’s parent is always its owning proc, even when the actor also appears as a child of another actor via supervision. -
NI-3 (children = navigation graph): A node’s
childrenis the admin navigation graph. Actor-to-actor supervision links coexist with proc→actor membership links without changingparent. The same actor may therefore appear inchildrenof both its proc and its supervising actor.
Together these ensure that the TUI can correlate responses to tree nodes, and that upward/downward navigation is consistent.
§Link-classification invariants (LC-*)
These describe which nodes emit system_children and
stopped_children classification sets, and what those sets
contain.
-
LC-1 (root system_children empty): Root payloads always emit
system_children: vec![]. Root children are host nodes, which are not classified as system. -
LC-2 (host system_children empty): Host payloads always emit
system_children: vec![]. Host children are procs, which are not classified as system — only actors carry the system classification. -
LC-3 (proc system_children subset): Proc payloads emit
system_children ⊆ children, containing onlyNodeRef::Actorrefs wherecell.is_system()is true. -
LC-4 (proc stopped_children subset): Proc payloads emit
stopped_children ⊆ children, containing onlyNodeRef::Actorrefs for terminated actors retained for post-mortem inspection. -
LC-5 (actor/error no classification sets): Actor and Error payloads do not carry
system_childrenorstopped_children.
§Proc-resolution invariants (SP-*)
When a proc reference is resolved, the returned NodePayload
satisfies:
- SP-1 (identity): The identity matches the ProcId reference from the parent’s children list.
- SP-2 (properties): The properties are
NodeProperties::Proc. - SP-3 (parent): The parent is
NodeRef::Host(actor_id). - SP-4 (as_of): The
as_offield is present and valid (internallySystemTime; serialized as ISO 8601 string over the HTTP JSON API per HB-1).
Enforced by test_system_proc_identity.
§Proc-agent invariants (PA-*)
- PA-1 (live children): Proc-node children used by admin/TUI must be derived from live proc state at query time. No additional publish event is required for a newly spawned actor to appear.
Enforced by test_proc_children_reflect_directly_spawned_actors.
§Robustness invariant (MA-R1)
- MA-R1 (no-crash):
MeshAdminAgentmust never crash the OS process it resides in. Every handler catches errors and converts them into structured error payloads (ResolveReferenceResponse(Err(..)),NodeProperties::Error, etc.) rather than propagating panics or unwinding. Failed reply sends (the caller went away) are silently swallowed.
§TLS transport invariant (MA-T1)
-
MA-T1 (tls): At Meta (
fbcode_build), the admin HTTP server requires mutual TLS. At startup it probes for certificates viatry_tls_acceptorwith client cert enforcement enabled. If no usable certificate bundle is found,init()returns an error — no plain HTTP fallback. In OSS, TLS is best-effort with plain HTTP fallback. -
MA-T2 (scheme-in-url): The URL returned by
GetAdminAddris alwayshttps://host:portorhttp://host:port, never a barehost:port. All callers receive and use this full URL directly.
§Client host invariants (CH-*)
Let A denote the aggregated host set (the union of hosts
from all meshes passed to [host_mesh::spawn_admin],
deduplicated by HostAgent ActorId — see SA-3), and let
C denote the process-global singleton client host mesh in
the caller process (whose local proc hosts the root client
actor).
-
CH-1 (deduplication): When C ∈ A, the client host appears exactly once in the admin host list (deduplicated by
HostAgentActorIdidentity). When C ∉ A,spawn_adminincludes C alongside A’s hosts so the admin introspects C as a normal host subtree, not as a standalone proc. -
CH-2 (reachability): In both cases, the root client actor is reachable through the standard host → proc → actor walk.
-
CH-3 (ordering): C must be initialized before
spawn_adminexecutes. In Rust, callingcontext()/this_host()/this_proc()triggersGLOBAL_CONTEXTbootstrap, which initializes C. In Python,bootstrap_host()callsregister_client_host()before any actor code runs. Either path ensures C is available by the timespawn_adminreads it viatry_this_host(). Any refactor must preserve this ordering. -
CH-4 (runtime-agnostic client-host discovery):
spawn_admindiscovers C viatry_this_host(), which checks two sources in order: the RustGLOBAL_CONTEXT(initialized viacontext()/this_host()/this_proc()) and the externally registered client host (set byregister_client_host()from Python’sbootstrap_host()). Aggregation logic must not branch on which source provided C.
Mechanism: [host_mesh::spawn_admin] aggregates hosts from
all input meshes (SA-3), reads C from the caller process (via
try_this_host()), merges it with the aggregated set (SA-6),
deduplicates by HostAgent ActorId, and spawns the
MeshAdminAgent on the caller’s local proc via
cx.instance().proc().spawn(...). Placement now follows the
caller context rather than mesh topology.
§Spawn/aggregation invariants (SA-*)
[host_mesh::spawn_admin] aggregates hosts from one or more
meshes into a single admin host set.
- SA-1 (non-empty mesh set): The input must yield at least one mesh.
- SA-2 (non-empty hosts): Every input mesh must contain at least one host.
- SA-3 (host-agent identity dedup): The admin host set is
the ordered union of host agents from all input meshes,
deduplicated by
HostAgentActorIdin first-seen order. - SA-4 (single-mesh degeneracy):
spawn_admin([mesh], ...)is behaviorally equivalent to the formermesh.spawn_admin(...). Established by existing single-mesh integration tests (e.g.dining_philosophers); no dedicated unit test. - SA-5 (caller-local placement): The admin is spawned on the
caller’s local proc — the
Procof the actor context passed tospawn_admin(). In common remote launch flows, the caller is typically the root client/control process. - SA-6 (client-host merge after aggregation): Client-host inclusion/dedup (CH-1) operates on the already-aggregated host set, not per-mesh independently.
§MAST resolution (disabled)
mast_conda:/// resolution is disabled. The old topology-based
resolution assumed the admin lived on the first mesh head host,
which is no longer true after SA-5 changed to caller-local
placement. All resolution paths now return explicit errors.
A publication-based discovery mechanism will replace this in a
future change. Until then, discover the admin URL from
startup output or another launch-time publication.
§Admin self-identification invariants (AI-*)
- AI-1 (live identity):
GET /v1/adminreturns the live admin actor identity asAdminInfo. - AI-2 (reported proc):
proc_idreports the hosting proc. Placement equality (SA-5) is proved by unit tests; integration tests validate thatproc_idis populated and well-formed. - AI-3 (url consistency):
urlmatchesGetAdminAddr.
The relationship between host and url (formerly AI-4) is
now a constructor guarantee of AdminInfo::new rather than a
live invariant. It is not in this registry.
Structs§
- Admin
Info - Self-identification payload returned by
GET /v1/admin. - ApiError
- Structured error response following the gateway RFC envelope pattern.
- ApiError
Envelope - Wrapper for the structured error envelope.
- Mesh
Admin Addr Response - Response payload for
MeshAdminMessage::GetAdminAddr. - Mesh
Admin Agent - Actor that serves a mesh-level admin HTTP endpoint.
- Pyspy
Dump AndStore Response - Response body from
POST /v1/pyspy_dump/{*proc_reference}. - Query
Request - Request body for
POST /v1/query. - Query
Response - Response body from
POST /v1/query. - Resolve
Reference Response - Newtype wrapper around
Result<NodePayload, String>for the resolve reply port (OncePortRefrequiresNamed).
Enums§
- Admin
Handle - A handle for locating the mesh admin server.
- Mesh
Admin Message - Messages handled by the
MeshAdminAgent. - Published
Handle - A handle scheme that requires a publication-based lookup to resolve to a concrete admin URL.
- Resolve
Reference Message - Message for resolving a reference (string from HTTP path) into a
NodePayload.
Constants§
- MESH_
ADMIN_ ACTOR_ NAME - Actor name used when spawning the mesh admin agent.
- MESH_
ADMIN_ BRIDGE_ NAME - Actor name for the HTTP bridge client mailbox on the service proc.
Traits§
- Mesh
Admin Message Client - The custom client trait for this message type.
- Mesh
Admin Message Handler - The custom handler trait for this message type.
- Resolve
Reference Message Client - The custom client trait for this message type.
- Resolve
Reference Message Handler - The custom handler trait for this message type.
Functions§
- build_
openapi_ spec - Build the OpenAPI 3.1 spec, embedding schemars-derived JSON
Schemas into
components/schemas. - resolve_
mast_ handle - Resolve a
mast_conda:///<job-name>handle into an admin base URL.