Module mesh_admin

Module mesh_admin 

Source
Expand description

Mesh-level admin surface for topology introspection and reference walking.

This module defines MeshAdminAgent, an actor that exposes a uniform, reference-based HTTP API over an entire host mesh. Every addressable entity in the mesh is represented as a NodePayload and resolved via an opaque reference string.

Incoming HTTP requests are bridged into the actor message loop using ResolveReferenceMessage, ensuring that all topology resolution and data collection happens through actor messaging. The agent fans out to HostAgent instances to fetch host, proc, and actor details, then normalizes them into a single tree-shaped model (NodeProperties + children references) suitable for topology-agnostic clients such as the admin TUI.

§Schema strategy

The external API contract is schema-first: the JSON Schema (Draft 2020-12) served at GET /v1/schema is the authoritative definition of the response shape, derived directly from the Rust types (NodePayload, NodeProperties, FailureInfo) via schemars::JsonSchema. The error envelope schema is at GET /v1/schema/error.

This follows the “Admin Gateway Pattern” RFC (doc): schema is the product; transports and tooling are projections.

§Schema generation pipeline

  1. #[derive(JsonSchema)] on NodePayload, NodeProperties, FailureInfo, ApiError, ApiErrorEnvelope.
  2. schemars::schema_for!(T) produces a Schema value at runtime (Draft 2020-12).
  3. The serve_schema / serve_error_schema handlers inject a $id field (SC-4) and serve the result as JSON.
  4. Snapshot tests in introspect::tests compare the raw schemars output (without $id) against checked-in golden files to detect drift (SC-2).
  5. Validation tests confirm that real NodePayload samples pass schema validation (SC-3).

§Regenerating snapshots

After intentional type changes to NodePayload, NodeProperties, FailureInfo, ApiError, or ApiErrorEnvelope, regenerate the golden files:

buck run fbcode//monarch/hyperactor_mesh:generate_api_artifacts \
  @fbcode//mode/dev-nosan -- \
  fbcode/monarch/hyperactor_mesh/src/testdata

Or via cargo:

cargo run -p hyperactor_mesh --bin generate_api_artifacts -- \
  hyperactor_mesh/src/testdata

Then re-run tests to confirm the new snapshot passes.

§Schema invariants (SC-*)

  • SC-1 (schema-derived): Schema is derived from Rust types via schemars::JsonSchema, not hand-written.
  • SC-2 (schema-snapshot-stability): Schema changes must be explicit — a snapshot test catches unintentional drift.
  • SC-3 (schema-payload-conformance): Real NodePayload instances validate against the generated schema.
  • SC-4 (schema-version-identity): Served schemas carry a $id tied to the API version (e.g. https://monarch.meta.com/schemas/v1/node_payload).
  • SC-5 (route-precedence): Literal schema routes are matched by specificity before the {*reference} wildcard (axum 0.8 specificity-based routing).

Note on ApiError.details: the derived schema is maximally permissive for details (any valid JSON). This is intentional for v1 — details is a domain-specific escape hatch. Consumers must not assume a fixed shape.

§Introspection visibility policy

Admin tooling only displays introspectable nodes: entities that are reachable via actor messaging and respond to IntrospectMessage. Infrastructure procs that are non-routable are intentionally opaque to introspection and are omitted from the navigation graph.

§Definitions

Routable — an entity is routable if the system can address it via the routing layer and successfully deliver a message to it using a Reference / ActorId (i.e., there exists a live mailbox sender reachable through normal routing). Practical test: “can I send IntrospectMessage::Query to it and get a reply?”

Non-routable — an entity is non-routable if it has no externally reachable mailbox sender in the routing layer, so message delivery is impossible by construction (even if you know its name). Examples: hyperactor_runtime[0], mailbox_server[N], local[N] — these use PanickingMailboxSender and are never bound to the router.

Introspectable — tooling can obtain a NodePayload for this node by sending IntrospectMessage to a routable actor.

Opaque — the node exists but is not introspectable via messaging; tooling cannot observe it through the introspection protocol.

§Proc visibility

A proc is not directly introspected; actors are. Tooling synthesizes proc-level nodes by grouping introspectable actors by ProcId.

A proc is visible iff there exists at least one actor on that proc whose ActorId is deliverable via the routing layer (i.e., the actor has a bound mailbox sender reachable through normal routing) and responds to IntrospectMessage.

The rule is: if an entity is routable via the mesh routing layer (i.e., tooling can deliver IntrospectMessage::Query to one of its actors), then it is introspectable and appears in the admin graph.

Every NodePayload in the topology tree satisfies:

  • NI-1 (identity = reference): A node’s identity field must equal the reference string used to resolve it. If the TUI asks for reference R, payload.identity == R.

  • NI-2 (parent coherence): A node’s parent field must equal the identity of the node it appears under. If node P lists R in its children, then R.parent == Some(P.identity).

Together these ensure that the TUI can correlate responses to tree nodes, and that upward/downward navigation is consistent.

§Proc-resolution invariants (SP-*)

When a proc reference is resolved, the returned NodePayload satisfies:

  • SP-1 (identity): The identity matches the ProcId reference from the parent’s children list.
  • SP-2 (properties): The properties are NodeProperties::Proc.
  • SP-3 (parent): The parent is set to the HostId format ("host:<actor_id>").
  • SP-4 (as_of): The as_of field is present and non-empty.

Enforced by test_system_proc_identity.

§Proc-agent invariants (PA-*)

  • PA-1 (live children): Proc-node children used by admin/TUI must be derived from live proc state at query time. No additional publish event is required for a newly spawned actor to appear.

Enforced by test_proc_children_reflect_directly_spawned_actors.

§Robustness invariant (MA-R1)

  • MA-R1 (no-crash): MeshAdminAgent must never crash the OS process it resides in. Every handler catches errors and converts them into structured error payloads (ResolveReferenceResponse(Err(..)), NodeProperties::Error, etc.) rather than propagating panics or unwinding. Failed reply sends (the caller went away) are silently swallowed.

§TLS transport invariant (MA-T1)

  • MA-T1 (tls): At Meta (fbcode_build), the admin HTTP server requires mutual TLS. At startup it probes for certificates via try_tls_acceptor with client cert enforcement enabled. If no usable certificate bundle is found, init() returns an error — no plain HTTP fallback. In OSS, TLS is best-effort with plain HTTP fallback.

  • MA-T2 (scheme-in-url): The URL returned by GetAdminAddr is always https://host:port or http://host:port, never a bare host:port. All callers receive and use this full URL directly.

§Client host invariants (CH-*)

Let A denote the observed host mesh (the host mesh for which this MeshAdminAgent was spawned), and let C denote the process-global singleton client host mesh in the caller process (whose local proc hosts the root client actor).

  • CH-1 (deduplication): When C ∈ A, the client host appears exactly once in the admin host list (deduplicated by HostAgent ActorId identity). When C ∉ A, spawn_admin includes C alongside A’s hosts so the admin introspects C as a normal host subtree, not as a standalone proc.

  • CH-2 (reachability): In both cases, the root client actor is reachable through the standard host → proc → actor walk.

  • CH-3 (ordering): spawn_admin requires cx: &impl context::Actor (the caller’s root client instance). Constructing that instance initializes C. Therefore C is available when spawn_admin executes. Any refactor must preserve this ordering.

Mechanism: [HostMeshRef::spawn_admin] reads C from the caller process (via try_this_host()), merges it with A’s host list, deduplicates by HostAgent ActorId, and sends the merged list in SpawnMeshAdmin. This works for same-process and cross-process setups because merge+dedeup happens in the caller process before sending the spawn request.

§MAST resolution invariants (MC-*)

CLI-based mast_conda:/// resolution (OSS-compatible fallback):

  • MC-1 (cli-contract): mast get-status --json <job> must exit 0 and produce valid JSON. Missing binary → distinct error. Non-zero exit → includes exit code and stderr. Malformed JSON → parse error.
  • MC-2 (head-hostname): head_hostname extracts the first hostname by ascending task index from the last attempt of each task group.
  • MC-3 (fqdn-idempotent): qualify_fqdn passes through hostnames containing a dot. Short hostnames are qualified via getaddrinfo(AI_CANONNAME). Failure falls back to the raw hostname.
  • MC-4 (fqdn-nonblocking): qualify_fqdn runs the blocking getaddrinfo syscall via spawn_blocking.
  • MC-5 (admin-port): resolve_admin_port uses the explicit override when provided, otherwise reads the port from MESH_ADMIN_ADDR config.

Enforced by test_head_hostname_*, test_qualify_fqdn_*, test_resolve_mast_*, test_resolve_admin_port_*.

Structs§

ApiError
Structured error response following the gateway RFC envelope pattern.
ApiErrorEnvelope
Wrapper for the structured error envelope.
MeshAdminAddrResponse
Response payload for MeshAdminMessage::GetAdminAddr.
MeshAdminAgent
Actor that serves a mesh-level admin HTTP endpoint.
ResolveReferenceResponse
Newtype wrapper around Result<NodePayload, String> for the resolve reply port (OncePortRef requires Named).

Enums§

MeshAdminMessage
Messages handled by the MeshAdminAgent.
ResolveReferenceMessage
Message for resolving an opaque reference string into a NodePayload.

Constants§

MESH_ADMIN_ACTOR_NAME
Actor name used when spawning the mesh admin agent.
MESH_ADMIN_BRIDGE_NAME
Actor name for the HTTP bridge client mailbox on the service proc.

Traits§

MeshAdminMessageClient
The custom client trait for this message type.
MeshAdminMessageHandler
The custom handler trait for this message type.
ResolveReferenceMessageClient
The custom client trait for this message type.
ResolveReferenceMessageHandler
The custom handler trait for this message type.

Functions§

build_openapi_spec
Build the OpenAPI 3.1 spec, embedding schemars-derived JSON Schemas into components/schemas.
resolve_mast_handle
Resolve a mast_conda:///<job-name> handle into an https://<fqdn>:<port> base URL using the mast CLI.