Module mesh_admin

Module mesh_admin 

Source
Expand description

Mesh-level admin surface for topology introspection and reference walking.

This module defines MeshAdminAgent, an actor that exposes a uniform, reference-based HTTP API over an entire host mesh. Every addressable entity in the mesh is represented as a NodePayload and resolved via typed NodeRef references (parsed from HTTP path strings at the request boundary).

Incoming HTTP requests are bridged into the actor message loop using ResolveReferenceMessage, ensuring that all topology resolution and data collection happens through actor messaging. The agent fans out to HostAgent instances to fetch host, proc, and actor details, then normalizes them into a single tree-shaped model (NodeProperties + children references) suitable for topology-agnostic clients such as the admin TUI.

§Schema strategy

The external API contract is schema-first: the JSON Schema (Draft 2020-12) served at GET /v1/schema is the authoritative definition of the response shape. The error envelope schema is at GET /v1/schema/error.

Schema and OpenAPI are derived from the HTTP boundary DTO types in crate::introspect::dto (NodePayloadDto, NodePropertiesDto, FailureInfoDto) via schemars::JsonSchema. The domain types (NodePayload, NodeProperties, FailureInfo) do not carry JsonSchema — they own the typed internal model; the DTOs own the wire contract.

This follows the “Admin Gateway Pattern” RFC (doc): schema is the product; transports and tooling are projections.

§Schema generation pipeline

  1. #[derive(JsonSchema)] on NodePayloadDto, NodePropertiesDto, FailureInfoDto, ApiError, ApiErrorEnvelope.
  2. schemars::schema_for!(T) produces a Schema value at runtime (Draft 2020-12).
  3. The serve_schema / serve_error_schema handlers inject a $id field (SC-4) and serve the result as JSON.
  4. Snapshot tests in introspect::tests compare the raw schemars output (without $id) against checked-in golden files to detect drift (SC-2).
  5. Validation tests construct domain payloads, convert to DTOs, and confirm the serialized DTOs pass schema validation (SC-3).

§Regenerating snapshots

After intentional changes to the DTO types (NodePayloadDto, NodePropertiesDto, FailureInfoDto), ApiError, or ApiErrorEnvelope, regenerate the golden files:

buck run fbcode//monarch/hyperactor_mesh:generate_api_artifacts \
  @fbcode//mode/dev-nosan -- \
  fbcode/monarch/hyperactor_mesh/src/testdata

Or via cargo:

cargo run -p hyperactor_mesh --bin generate_api_artifacts -- \
  hyperactor_mesh/src/testdata

Then re-run tests to confirm the new snapshot passes.

§Schema invariants (SC-*)

  • SC-1 (schema-derived): Schema is derived from the DTO types via schemars::JsonSchema, not hand-written.
  • SC-2 (schema-snapshot-stability): Schema changes must be explicit — a snapshot test catches unintentional drift.
  • SC-3 (schema-payload-conformance): Domain payloads converted to DTOs validate against the generated schema.
  • SC-4 (schema-version-identity): Served schemas carry a $id tied to the API version (e.g. https://monarch.meta.com/schemas/v1/node_payload).
  • SC-5 (route-precedence): Literal schema routes are matched by specificity before the {*reference} wildcard (axum 0.8 specificity-based routing).

Note on ApiError.details: the derived schema is maximally permissive for details (any valid JSON). This is intentional for v1 — details is a domain-specific escape hatch. Consumers must not assume a fixed shape.

§Introspection visibility policy

Admin tooling only displays introspectable nodes: entities that are reachable via actor messaging and respond to IntrospectMessage. Infrastructure procs that are non-routable are intentionally opaque to introspection and are omitted from the navigation graph.

§Definitions

Routable — an entity is routable if the system can address it via the routing layer and successfully deliver a message to it using a Reference / ActorId (i.e., there exists a live mailbox sender reachable through normal routing). Practical test: “can I send IntrospectMessage::Query to it and get a reply?”

Non-routable — an entity is non-routable if it has no externally reachable mailbox sender in the routing layer, so message delivery is impossible by construction (even if you know its name). Examples: hyperactor_runtime[0], mailbox_server[N], local[N] — these use PanickingMailboxSender and are never bound to the router.

Introspectable — tooling can obtain a NodePayload for this node by sending IntrospectMessage to a routable actor.

Opaque — the node exists but is not introspectable via messaging; tooling cannot observe it through the introspection protocol.

§Proc visibility

A proc is not directly introspected; actors are. Tooling synthesizes proc-level nodes by grouping introspectable actors by ProcId.

A proc is visible iff there exists at least one actor on that proc whose ActorId is deliverable via the routing layer (i.e., the actor has a bound mailbox sender reachable through normal routing) and responds to IntrospectMessage.

The rule is: if an entity is routable via the mesh routing layer (i.e., tooling can deliver IntrospectMessage::Query to one of its actors), then it is introspectable and appears in the admin graph.

Every NodePayload in the topology tree satisfies:

  • NI-1 (identity = reference): A node’s identity: NodeRef must correspond to the reference used to resolve it. The display form of identity round-trips through NodeRef::from_str.

  • NI-2 (parent = containment parent): A node’s parent: Option<NodeRef> records its canonical containment parent, not the inverse of every navigation edge. Specifically: root → None, host → Root, proc → Host(…), actor → Proc(…). An actor’s parent is always its owning proc, even when the actor also appears as a child of another actor via supervision.

  • NI-3 (children = navigation graph): A node’s children is the admin navigation graph. Actor-to-actor supervision links coexist with proc→actor membership links without changing parent. The same actor may therefore appear in children of both its proc and its supervising actor.

Together these ensure that the TUI can correlate responses to tree nodes, and that upward/downward navigation is consistent.

These describe which nodes emit system_children and stopped_children classification sets, and what those sets contain.

  • LC-1 (root system_children empty): Root payloads always emit system_children: vec![]. Root children are host nodes, which are not classified as system.

  • LC-2 (host system_children empty): Host payloads always emit system_children: vec![]. Host children are procs, which are not classified as system — only actors carry the system classification.

  • LC-3 (proc system_children subset): Proc payloads emit system_children ⊆ children, containing only NodeRef::Actor refs where cell.is_system() is true.

  • LC-4 (proc stopped_children subset): Proc payloads emit stopped_children ⊆ children, containing only NodeRef::Actor refs for terminated actors retained for post-mortem inspection.

  • LC-5 (actor/error no classification sets): Actor and Error payloads do not carry system_children or stopped_children.

§Proc-resolution invariants (SP-*)

When a proc reference is resolved, the returned NodePayload satisfies:

  • SP-1 (identity): The identity matches the ProcId reference from the parent’s children list.
  • SP-2 (properties): The properties are NodeProperties::Proc.
  • SP-3 (parent): The parent is NodeRef::Host(actor_id).
  • SP-4 (as_of): The as_of field is present and valid (internally SystemTime; serialized as ISO 8601 string over the HTTP JSON API per HB-1).

Enforced by test_system_proc_identity.

§Proc-agent invariants (PA-*)

  • PA-1 (live children): Proc-node children used by admin/TUI must be derived from live proc state at query time. No additional publish event is required for a newly spawned actor to appear.

Enforced by test_proc_children_reflect_directly_spawned_actors.

§Robustness invariant (MA-R1)

  • MA-R1 (no-crash): MeshAdminAgent must never crash the OS process it resides in. Every handler catches errors and converts them into structured error payloads (ResolveReferenceResponse(Err(..)), NodeProperties::Error, etc.) rather than propagating panics or unwinding. Failed reply sends (the caller went away) are silently swallowed.

§TLS transport invariant (MA-T1)

  • MA-T1 (tls): At Meta (fbcode_build), the admin HTTP server requires mutual TLS. At startup it probes for certificates via try_tls_acceptor with client cert enforcement enabled. If no usable certificate bundle is found, init() returns an error — no plain HTTP fallback. In OSS, TLS is best-effort with plain HTTP fallback.

  • MA-T2 (scheme-in-url): The URL returned by GetAdminAddr is always https://host:port or http://host:port, never a bare host:port. All callers receive and use this full URL directly.

§Client host invariants (CH-*)

Let A denote the aggregated host set (the union of hosts from all meshes passed to [host_mesh::spawn_admin], deduplicated by HostAgent ActorId — see SA-3), and let C denote the process-global singleton client host mesh in the caller process (whose local proc hosts the root client actor).

  • CH-1 (deduplication): When C ∈ A, the client host appears exactly once in the admin host list (deduplicated by HostAgent ActorId identity). When C ∉ A, spawn_admin includes C alongside A’s hosts so the admin introspects C as a normal host subtree, not as a standalone proc.

  • CH-2 (reachability): In both cases, the root client actor is reachable through the standard host → proc → actor walk.

  • CH-3 (ordering): C must be initialized before spawn_admin executes. In Rust, calling context() / this_host() / this_proc() triggers GLOBAL_CONTEXT bootstrap, which initializes C. In Python, bootstrap_host() calls register_client_host() before any actor code runs. Either path ensures C is available by the time spawn_admin reads it via try_this_host(). Any refactor must preserve this ordering.

  • CH-4 (runtime-agnostic client-host discovery): spawn_admin discovers C via try_this_host(), which checks two sources in order: the Rust GLOBAL_CONTEXT (initialized via context() / this_host() / this_proc()) and the externally registered client host (set by register_client_host() from Python’s bootstrap_host()). Aggregation logic must not branch on which source provided C.

Mechanism: [host_mesh::spawn_admin] aggregates hosts from all input meshes (SA-3), reads C from the caller process (via try_this_host()), merges it with the aggregated set (SA-6), deduplicates by HostAgent ActorId, and spawns the MeshAdminAgent on the caller’s local proc via cx.instance().proc().spawn(...). Placement now follows the caller context rather than mesh topology.

§Spawn/aggregation invariants (SA-*)

[host_mesh::spawn_admin] aggregates hosts from one or more meshes into a single admin host set.

  • SA-1 (non-empty mesh set): The input must yield at least one mesh.
  • SA-2 (non-empty hosts): Every input mesh must contain at least one host.
  • SA-3 (host-agent identity dedup): The admin host set is the ordered union of host agents from all input meshes, deduplicated by HostAgent ActorId in first-seen order.
  • SA-4 (single-mesh degeneracy): spawn_admin([mesh], ...) is behaviorally equivalent to the former mesh.spawn_admin(...). Established by existing single-mesh integration tests (e.g. dining_philosophers); no dedicated unit test.
  • SA-5 (caller-local placement): The admin is spawned on the caller’s local proc — the Proc of the actor context passed to spawn_admin(). In common remote launch flows, the caller is typically the root client/control process.
  • SA-6 (client-host merge after aggregation): Client-host inclusion/dedup (CH-1) operates on the already-aggregated host set, not per-mesh independently.

§MAST resolution (disabled)

mast_conda:/// resolution is disabled. The old topology-based resolution assumed the admin lived on the first mesh head host, which is no longer true after SA-5 changed to caller-local placement. All resolution paths now return explicit errors. A publication-based discovery mechanism will replace this in a future change. Until then, discover the admin URL from startup output or another launch-time publication.

§Admin self-identification invariants (AI-*)

  • AI-1 (live identity): GET /v1/admin returns the live admin actor identity as AdminInfo.
  • AI-2 (reported proc): proc_id reports the hosting proc. Placement equality (SA-5) is proved by unit tests; integration tests validate that proc_id is populated and well-formed.
  • AI-3 (url consistency): url matches GetAdminAddr.

The relationship between host and url (formerly AI-4) is now a constructor guarantee of AdminInfo::new rather than a live invariant. It is not in this registry.

Structs§

AdminInfo
Self-identification payload returned by GET /v1/admin.
ApiError
Structured error response following the gateway RFC envelope pattern.
ApiErrorEnvelope
Wrapper for the structured error envelope.
MeshAdminAddrResponse
Response payload for MeshAdminMessage::GetAdminAddr.
MeshAdminAgent
Actor that serves a mesh-level admin HTTP endpoint.
PyspyDumpAndStoreResponse
Response body from POST /v1/pyspy_dump/{*proc_reference}.
QueryRequest
Request body for POST /v1/query.
QueryResponse
Response body from POST /v1/query.
ResolveReferenceResponse
Newtype wrapper around Result<NodePayload, String> for the resolve reply port (OncePortRef requires Named).

Enums§

AdminHandle
A handle for locating the mesh admin server.
MeshAdminMessage
Messages handled by the MeshAdminAgent.
PublishedHandle
A handle scheme that requires a publication-based lookup to resolve to a concrete admin URL.
ResolveReferenceMessage
Message for resolving a reference (string from HTTP path) into a NodePayload.

Constants§

MESH_ADMIN_ACTOR_NAME
Actor name used when spawning the mesh admin agent.
MESH_ADMIN_BRIDGE_NAME
Actor name for the HTTP bridge client mailbox on the service proc.

Traits§

MeshAdminMessageClient
The custom client trait for this message type.
MeshAdminMessageHandler
The custom handler trait for this message type.
ResolveReferenceMessageClient
The custom client trait for this message type.
ResolveReferenceMessageHandler
The custom handler trait for this message type.

Functions§

build_openapi_spec
Build the OpenAPI 3.1 spec, embedding schemars-derived JSON Schemas into components/schemas.
resolve_mast_handle
Resolve a mast_conda:///<job-name> handle into an admin base URL.