Module introspect

Module introspect 

Source
Expand description

Mesh-topology introspection types and attrs.

This module owns the typed internal model used by mesh-admin and the TUI: mesh-topology attr keys, typed attrs views, NodeRef, and the domain NodePayload / NodeProperties / FailureInfo values derived from hyperactor::introspect::IntrospectResult.

These keys are published by HostMeshAgent, ProcAgent, and MeshAdminAgent to describe mesh topology (hosts, procs, root). Actor-runtime keys (status, actor_type, messages_processed, etc.) are declared in hyperactor::introspect.

The HTTP wire representations live in dto. That submodule owns the curl-friendly JSON contract, schema/OpenAPI generation, and boundary invariants for string-encoded references and timestamps. This module keeps the internal typed invariants.

See hyperactor::introspect for naming convention, invariant labels, and the IntrospectAttr meta-attribute pattern.

§Mesh key invariants (MK-*)

  • MK-1 (metadata completeness): Every mesh-topology introspection key must carry @meta(INTROSPECT = ...) with non-empty name and desc.
  • MK-2 (short-name uniqueness): Covered by test_introspect_short_names_are_globally_unique in hyperactor::introspect (cross-crate).

§HTTP boundary invariants (HB-*)

These govern the HTTP DTO layer in dto.

  • HB-1 (typed-internal, string-external): NodeRef, ActorId, ProcId, and SystemTime are typed Rust values internally. At the HTTP JSON boundary, dto::NodePayloadDto, dto::NodePropertiesDto, and dto::FailureInfoDto encode them as canonical strings.
  • HB-2 (round-trip): The HTTP string forms round-trip through the internal typed parsers (NodeRef::from_str, ActorId::from_str, humantime::parse_rfc3339). Timestamps are formatted at millisecond precision; sub-millisecond values are truncated at the boundary.
  • HB-3 (schema-honesty): Schema/OpenAPI are generated from the DTO types, so the published schema reflects the actual wire format rather than the internal domain representation.

§Attrs invariants (IA-*)

These govern how IntrospectResult.attrs is built in hyperactor::introspect and how properties is derived via derive_properties.

  • IA-1 (attrs-json): IntrospectResult.attrs is always a valid JSON object string.
  • IA-2 (runtime-precedence): Runtime-owned introspection keys override any same-named keys in published attrs.
  • IA-3 (status-shape): status_reason is present in attrs iff the status string carries a reason.
  • IA-4 (failure-shape): failure_* attrs are present iff effective status is failed.
  • IA-5 (payload-totality): Every IntrospectResult sets attrs – never omitted, never null.
  • IA-6 (open-row-forward-compat): View decoders ignore unknown attrs keys; only required known keys and local invariants affect decoding outcome. Concretized by AV-3.

§Attrs view invariants (AV-*)

These govern the typed view layer (*AttrsView structs).

  • AV-1 (view-roundtrip): For each view V, V::from_attrs(&v.to_attrs()) == Ok(v) (modulo documented normalization/defaulting).
  • AV-2 (required-key-strictness): from_attrs fails iff required keys for that view are missing.
  • AV-3 (unknown-key-tolerance): Unknown attrs keys must not affect successful decode outcome. Concretization of IA-6.

§Derive invariants (DP-*)

  • DP-1 (derive-precedence): derive_properties dispatches on node_type first, then falls back to error_code, then status, then unknown. This order is the canonical detection chain.
  • DP-2 (derive-totality-on-parse-failure): derive_properties is total; malformed or incoherent attrs never panic and map to NodeProperties::Error with detail.
  • DP-3 (derive-precedence-stability): derive_properties detection order is stable and explicit: node_type > error_code > status > unknown.
  • DP-4 (error-on-decode-failure): Any view decode or invariant failure maps to a deterministic NodeProperties::Error with a malformed_* code family, without panic.

§py-spy integration (PS-*)

  • PS-1 (target locality): PySpyDump always targets std::process::id() of the handling ProcAgent process. No caller-supplied PID exists in the API.
  • PS-2 (deterministic failure shape): Execution failures are classified into BinaryNotFound { searched } vs Failed { pid, binary, exit_code, stderr }, never collapsed.
  • PS-3 (binary resolution order): Resolution order is exactly: PYSPY_BIN config attr (if non-empty) then "py-spy" on PATH. The attr is read via hyperactor_config::global::get_cloned; env var PYSPY_BIN feeds in through the config layer. If the first attempt is not found, the fallback attempt is required.
  • PS-4 (structured JSON output): py-spy runs with --json; output is parsed into Vec<PySpyStackTrace>. Parse failure maps to PySpyResult::Failed.
  • PS-5 (subprocess timeout): try_exec bounds the py-spy subprocess inside the worker to MESH_ADMIN_PYSPY_TIMEOUT (default 10s). The budget is sized for --native --native-all which unwinds native stacks via libunwind — significantly slower than Python-only capture on loaded hosts. On expiry the child is killed and reaped, and the worker returns Failed { stderr: "…timed out…" }.
  • PS-6 (bridge timeout): The HTTP bridge uses a separate MESH_ADMIN_PYSPY_BRIDGE_TIMEOUT (default 13s), which must exceed MESH_ADMIN_PYSPY_TIMEOUT so the subprocess kill/reap and reply can arrive before the bridge declares gateway_timeout. Independent of MESH_ADMIN_SINGLE_HOST_TIMEOUT.
  • PS-7 (non-blocking delegation): ProcAgent never awaits py-spy execution inline. On PySpyDump it spawns a child PySpyWorker, forwards the request, and returns immediately.
  • PS-8 (worker lifecycle): Each PySpyWorker handles exactly one forwarded RunPySpyDump, replies directly to the forwarded OncePortRef, then self-terminates via cx.stop(). Clean exit, no supervision event.
  • PS-9 (concurrent dumps): py-spy is spawn-per-request, so overlapping dumps on the same proc are allowed. Each worker runs independently.
  • PS-10 (nonblocking retry): In nonblocking mode, try_exec retries up to 3 times with 100ms backoff on failure, because py-spy can segfault reading mutating process memory. All attempts share a single deadline bounded by MESH_ADMIN_PYSPY_TIMEOUT (PS-5).
  • PS-11a (native-all-immediate-downgrade): If py-spy rejects --native-all with the recognized unsupported-flag signature (exit code 2, stderr mentions --native-all), try_exec retries immediately with native_all = false in the same outer attempt.
  • PS-11b (native-all-no-retry-consumption): That downgrade retry does not consume an outer nonblocking retry slot (PS-10) and does not incur the 100ms inter-attempt backoff.
  • PS-11c (native-all-downgrade-warning): A successful downgraded result includes the warning "--native-all unsupported by this py-spy; fell back to --native".
  • PS-11d (native-all-failure-passthrough): If the downgraded retry also fails, the failure flows through the normal nonblocking retry logic (PS-10) unchanged.
  • PS-11e (native-all-sticky-downgrade): Once the unsupported-flag signature is detected, effective_opts.native_all remains false for all subsequent outer retries. The flag is not re-tested on later attempts.
  • PS-12 (universal py-spy): Worker procs and the service proc can handle PySpyDump. Worker procs handle it via ProcAgent; the service proc handles it via HostAgent (same spawn-worker pattern). pyspy_bridge routes by proc name: if proc_id.base_name() == SERVICE_PROC_NAME, the target is host_agent; otherwise proc_agent[0]. Procs lacking either agent (e.g. mesh-admin) fast-fail via PS-13.
  • PS-13 (defensive probe): Before sending PySpyDump, pyspy_bridge probes the selected actor with an introspect query bounded by MESH_ADMIN_QUERY_CHILD_TIMEOUT (default 100ms). Three outcomes: (a) probe reply arrives — proceed with PySpyDump; (b) probe times out or recv closes — return not_found (actor absent/unreachable); (c) probe send itself fails — return internal_error (bridge-side infrastructure failure). Cases (b) and (c) fast-fail instead of waiting the full 13s MESH_ADMIN_PYSPY_BRIDGE_TIMEOUT.
  • PS-14 (reachability-based capability): A proc supports py-spy iff its stable handler actor is reachable: the service proc requires a reachable host_agent; non-service procs require a reachable proc_agent[0]. PySpyWorker is transient per-request machinery (spawned on PySpyDump, stopped after replying) and is not part of the reachability contract.

v1 contract notes:

  • The current py-spy bridge expects a ProcId-form reference and rejects other forms as bad_request. This may be broadened in future versions.
  • If worker.send() fails after the reply port has moved into RunPySpyDump, the caller receives no explicit PySpyResult::Failed — they observe a timeout. MailboxSenderError does not carry the unsent message, so the port is irrecoverable on this path.
  • Contract change (D96756537 follow-up): PySpyResult::Ok replaced stack: String (raw py-spy text) with stack_traces: Vec<PySpyStackTrace> (structured JSON) and added warnings: Vec<String>. Clients reading the old stack field will see it absent; they must migrate to stack_traces.

§Mesh-admin config (MA-*)

  • MA-C1 (timeout config centralization): Mesh-admin timeout budgets are read from config attrs at call-time, with defaults in config.rs. No hardcoded timeout constants in mesh_admin.rs.

Modules§

dto
HTTP boundary DTO types for mesh-admin introspection.

Structs§

ErrorAttrsView
Typed view over attrs for an error node.
FailureInfo
Structured failure information for failed actors.
HostAttrsView
Typed view over attrs for a host node.
NodePayload
Uniform response for any node in the mesh topology.
ProcAttrsView
Typed view over attrs for a proc node.
RootAttrsView
Typed view over attrs for a root node.

Enums§

NodeProperties
Node-specific metadata. Externally-tagged enum — the variant name is the discriminator (Root, Host, Proc, Actor, Error).
NodeRef
Typed reference to a node in the mesh-admin navigation tree.
NodeRefParseError
Error parsing a NodeRef from a string.

Statics§

ADDR
Host network address (e.g. “10.0.0.1:8080”).
FAILED_ACTOR_COUNT
Count of failed actors in a proc.
IS_POISONED
Whether this proc is refusing new spawns due to actor failures.
NODE_TYPE
Topology role of this node: “root”, “host”, “proc”, “error”.
NUM_ACTORS
Number of actors in a proc.
NUM_HOSTS
Number of hosts in the mesh (root only).
NUM_PROCS
Number of procs on a host.
PROC_NAME
Human-readable proc name.
STARTED_AT
Timestamp when the mesh was started.
STARTED_BY
Username who started the mesh.
STOPPED_CHILDREN
References of stopped children (proc only).
STOPPED_RETENTION_CAP
Cap on stopped children retention.
SYSTEM_CHILDREN
References of system/infrastructure children.

Functions§

derive_properties
Derive NodeProperties from a JSON-serialized attrs string.
to_node_payload
Convert an IntrospectResult to a presentation NodePayload. Lifts IntrospectRefNodeRef and passes through typed timestamps.
to_node_payload_with
Convert an IntrospectResult to a NodePayload, overriding identity and parent for correct tree navigation.