Expand description
Mesh-topology introspection types and attrs.
This module owns the typed internal model used by mesh-admin and the
TUI: mesh-topology attr keys, typed attrs views, NodeRef, and the
domain NodePayload / NodeProperties / FailureInfo values derived
from hyperactor::introspect::IntrospectResult.
These keys are published by HostMeshAgent, ProcAgent, and
MeshAdminAgent to describe mesh topology (hosts, procs, root).
Actor-runtime keys (status, actor_type, messages_processed, etc.) are
declared in hyperactor::introspect.
The HTTP wire representations live in dto. That submodule owns the
curl-friendly JSON contract, schema/OpenAPI generation, and boundary
invariants for string-encoded references and timestamps. This module
keeps the internal typed invariants.
See hyperactor::introspect for naming convention, invariant
labels, and the IntrospectAttr meta-attribute pattern.
§Mesh key invariants (MK-*)
- MK-1 (metadata completeness): Every mesh-topology
introspection key must carry
@meta(INTROSPECT = ...)with non-emptynameanddesc. - MK-2 (short-name uniqueness): Covered by
test_introspect_short_names_are_globally_uniqueinhyperactor::introspect(cross-crate).
§HTTP boundary invariants (HB-*)
These govern the HTTP DTO layer in dto.
- HB-1 (typed-internal, string-external):
NodeRef,ActorId,ProcId, andSystemTimeare typed Rust values internally. At the HTTP JSON boundary,dto::NodePayloadDto,dto::NodePropertiesDto, anddto::FailureInfoDtoencode them as canonical strings. - HB-2 (round-trip): The HTTP string forms round-trip through the
internal typed parsers (
NodeRef::from_str,ActorId::from_str,humantime::parse_rfc3339). Timestamps are formatted at millisecond precision; sub-millisecond values are truncated at the boundary. - HB-3 (schema-honesty): Schema/OpenAPI are generated from the DTO types, so the published schema reflects the actual wire format rather than the internal domain representation.
§Attrs invariants (IA-*)
These govern how IntrospectResult.attrs is built in
hyperactor::introspect and how properties is derived via
derive_properties.
- IA-1 (attrs-json):
IntrospectResult.attrsis always a valid JSON object string. - IA-2 (runtime-precedence): Runtime-owned introspection keys override any same-named keys in published attrs.
- IA-3 (status-shape):
status_reasonis present in attrs iff the status string carries a reason. - IA-4 (failure-shape):
failure_*attrs are present iff effective status isfailed. - IA-5 (payload-totality): Every
IntrospectResultsetsattrs– never omitted, never null. - IA-6 (open-row-forward-compat): View decoders ignore unknown attrs keys; only required known keys and local invariants affect decoding outcome. Concretized by AV-3.
§Attrs view invariants (AV-*)
These govern the typed view layer (*AttrsView structs).
- AV-1 (view-roundtrip): For each view V,
V::from_attrs(&v.to_attrs()) == Ok(v)(modulo documented normalization/defaulting). - AV-2 (required-key-strictness):
from_attrsfails iff required keys for that view are missing. - AV-3 (unknown-key-tolerance): Unknown attrs keys must not affect successful decode outcome. Concretization of IA-6.
§Derive invariants (DP-*)
- DP-1 (derive-precedence):
derive_propertiesdispatches onnode_typefirst, then falls back toerror_code, thenstatus, then unknown. This order is the canonical detection chain. - DP-2 (derive-totality-on-parse-failure):
derive_propertiesis total; malformed or incoherent attrs never panic and map toNodeProperties::Errorwith detail. - DP-3 (derive-precedence-stability):
derive_propertiesdetection order is stable and explicit:node_type>error_code>status> unknown. - DP-4 (error-on-decode-failure): Any view decode or
invariant failure maps to a deterministic
NodeProperties::Errorwith amalformed_*code family, without panic.
§py-spy integration (PS-*)
- PS-1 (target locality):
PySpyDumpalways targetsstd::process::id()of the handling ProcAgent process. No caller-supplied PID exists in the API. - PS-2 (deterministic failure shape): Execution failures are
classified into
BinaryNotFound { searched }vsFailed { pid, binary, exit_code, stderr }, never collapsed. - PS-3 (binary resolution order): Resolution order is exactly:
PYSPY_BINconfig attr (if non-empty) then"py-spy"on PATH. The attr is read viahyperactor_config::global::get_cloned; env varPYSPY_BINfeeds in through the config layer. If the first attempt is not found, the fallback attempt is required. - PS-4 (structured JSON output): py-spy runs with
--json; output is parsed intoVec<PySpyStackTrace>. Parse failure maps toPySpyResult::Failed. - PS-5 (subprocess timeout):
try_execbounds the py-spy subprocess inside the worker toMESH_ADMIN_PYSPY_TIMEOUT(default 10s). The budget is sized for--native --native-allwhich unwinds native stacks via libunwind — significantly slower than Python-only capture on loaded hosts. On expiry the child is killed and reaped, and the worker returnsFailed { stderr: "…timed out…" }. - PS-6 (bridge timeout): The HTTP bridge uses a separate
MESH_ADMIN_PYSPY_BRIDGE_TIMEOUT(default 13s), which must exceedMESH_ADMIN_PYSPY_TIMEOUTso the subprocess kill/reap and reply can arrive before the bridge declaresgateway_timeout. Independent ofMESH_ADMIN_SINGLE_HOST_TIMEOUT. - PS-7 (non-blocking delegation): ProcAgent never awaits
py-spy execution inline. On
PySpyDumpit spawns a childPySpyWorker, forwards the request, and returns immediately. - PS-8 (worker lifecycle): Each
PySpyWorkerhandles exactly one forwardedRunPySpyDump, replies directly to the forwardedOncePortRef, then self-terminates viacx.stop(). Clean exit, no supervision event. - PS-9 (concurrent dumps): py-spy is spawn-per-request, so overlapping dumps on the same proc are allowed. Each worker runs independently.
- PS-10 (nonblocking retry): In nonblocking mode,
try_execretries up to 3 times with 100ms backoff on failure, because py-spy can segfault reading mutating process memory. All attempts share a single deadline bounded byMESH_ADMIN_PYSPY_TIMEOUT(PS-5). - PS-11a (native-all-immediate-downgrade): If py-spy rejects
--native-allwith the recognized unsupported-flag signature (exit code 2, stderr mentions--native-all),try_execretries immediately withnative_all = falsein the same outer attempt. - PS-11b (native-all-no-retry-consumption): That downgrade retry does not consume an outer nonblocking retry slot (PS-10) and does not incur the 100ms inter-attempt backoff.
- PS-11c (native-all-downgrade-warning): A successful
downgraded result includes the warning
"--native-all unsupported by this py-spy; fell back to --native". - PS-11d (native-all-failure-passthrough): If the downgraded retry also fails, the failure flows through the normal nonblocking retry logic (PS-10) unchanged.
- PS-11e (native-all-sticky-downgrade): Once the
unsupported-flag signature is detected,
effective_opts.native_allremainsfalsefor all subsequent outer retries. The flag is not re-tested on later attempts. - PS-12 (universal py-spy): Worker procs and the service
proc can handle
PySpyDump. Worker procs handle it via ProcAgent; the service proc handles it via HostAgent (same spawn-worker pattern).pyspy_bridgeroutes by proc name: ifproc_id.base_name() == SERVICE_PROC_NAME, the target ishost_agent; otherwiseproc_agent[0]. Procs lacking either agent (e.g. mesh-admin) fast-fail via PS-13. - PS-13 (defensive probe): Before sending
PySpyDump,pyspy_bridgeprobes the selected actor with an introspect query bounded byMESH_ADMIN_QUERY_CHILD_TIMEOUT(default 100ms). Three outcomes: (a) probe reply arrives — proceed withPySpyDump; (b) probe times out or recv closes — returnnot_found(actor absent/unreachable); (c) probe send itself fails — returninternal_error(bridge-side infrastructure failure). Cases (b) and (c) fast-fail instead of waiting the full 13sMESH_ADMIN_PYSPY_BRIDGE_TIMEOUT. - PS-14 (reachability-based capability): A proc supports
py-spy iff its stable handler actor is reachable: the
service proc requires a reachable
host_agent; non-service procs require a reachableproc_agent[0].PySpyWorkeris transient per-request machinery (spawned onPySpyDump, stopped after replying) and is not part of the reachability contract.
v1 contract notes:
- The current py-spy bridge expects a ProcId-form reference and
rejects other forms as
bad_request. This may be broadened in future versions. - If
worker.send()fails after the reply port has moved intoRunPySpyDump, the caller receives no explicitPySpyResult::Failed— they observe a timeout.MailboxSenderErrordoes not carry the unsent message, so the port is irrecoverable on this path. - Contract change (D96756537 follow-up):
PySpyResult::Okreplacedstack: String(raw py-spy text) withstack_traces: Vec<PySpyStackTrace>(structured JSON) and addedwarnings: Vec<String>. Clients reading the oldstackfield will see it absent; they must migrate tostack_traces.
§Mesh-admin config (MA-*)
- MA-C1 (timeout config centralization): Mesh-admin timeout
budgets are read from config attrs at call-time, with defaults
in
config.rs. No hardcoded timeout constants inmesh_admin.rs.
Modules§
- dto
- HTTP boundary DTO types for mesh-admin introspection.
Structs§
- Error
Attrs View - Typed view over attrs for an error node.
- Failure
Info - Structured failure information for failed actors.
- Host
Attrs View - Typed view over attrs for a host node.
- Node
Payload - Uniform response for any node in the mesh topology.
- Proc
Attrs View - Typed view over attrs for a proc node.
- Root
Attrs View - Typed view over attrs for a root node.
Enums§
- Node
Properties - Node-specific metadata. Externally-tagged enum — the variant name is the discriminator (Root, Host, Proc, Actor, Error).
- NodeRef
- Typed reference to a node in the mesh-admin navigation tree.
- Node
RefParse Error - Error parsing a
NodeReffrom a string.
Statics§
- ADDR
- Host network address (e.g. “10.0.0.1:8080”).
- FAILED_
ACTOR_ COUNT - Count of failed actors in a proc.
- IS_
POISONED - Whether this proc is refusing new spawns due to actor failures.
- NODE_
TYPE - Topology role of this node: “root”, “host”, “proc”, “error”.
- NUM_
ACTORS - Number of actors in a proc.
- NUM_
HOSTS - Number of hosts in the mesh (root only).
- NUM_
PROCS - Number of procs on a host.
- PROC_
NAME - Human-readable proc name.
- STARTED_
AT - Timestamp when the mesh was started.
- STARTED_
BY - Username who started the mesh.
- STOPPED_
CHILDREN - References of stopped children (proc only).
- STOPPED_
RETENTION_ CAP - Cap on stopped children retention.
- SYSTEM_
CHILDREN - References of system/infrastructure children.
Functions§
- derive_
properties - Derive
NodePropertiesfrom a JSON-serialized attrs string. - to_
node_ payload - Convert an
IntrospectResultto a presentationNodePayload. LiftsIntrospectRef→NodeRefand passes through typed timestamps. - to_
node_ payload_ with - Convert an
IntrospectResultto aNodePayload, overriding identity and parent for correct tree navigation.