Skip to main content

hyperactor/
introspect.rs

1/*
2 * Copyright (c) Meta Platforms, Inc. and affiliates.
3 * All rights reserved.
4 *
5 * This source code is licensed under the BSD-style license found in the
6 * LICENSE file in the root directory of this source tree.
7 */
8
9//! Introspection protocol for hyperactor actors.
10//!
11//! Every actor has a dedicated introspect task that handles
12//! [`IntrospectMessage`] by reading [`InstanceCell`] state directly,
13//! without going through the actor's message loop. This means:
14//!
15//! - Stuck actors can be introspected (the task runs independently).
16//! - Introspection does not perturb observed state (no Heisenberg).
17//! - Live status is reported accurately.
18//!
19//! Infrastructure actors publish domain-specific metadata via
20//! `publish_attrs()`, which the introspect task reads for Entity-view
21//! queries. Non-addressable children (e.g., system procs) are
22//! resolved via a callback registered on [`InstanceCell`].
23//!
24//! Callers navigate topology by fetching an [`IntrospectResult`] and
25//! following its `children` references.
26//!
27//! # Design Invariants
28//!
29//! The introspection subsystem maintains twelve invariants (S1--S12).
30//! Each is documented at the code site that enforces it.
31//!
32//! - **S1.** Introspection must not depend on actor responsiveness --
33//!   a wedged actor can still be introspected (runtime task, not
34//!   actor loop).
35//! - **S2.** Introspection must not perturb observed state -- reading
36//!   `InstanceCell` never sets `last_message_handler` to
37//!   `IntrospectMessage`.
38//! - **S3.** Sender routing is unchanged -- senders target the same
39//!   `PortId` (`IntrospectMessage::port()`) across processes.
40//! - **S4.** `IntrospectMessage` never produces a `WorkCell` --
41//!   pre-registration via `bind_handler_port` gives the introspect
42//!   port its own channel, independent of the actor's work queue.
43//! - **S5.** Replies never use `PanickingMailboxSender` -- the
44//!   introspect task replies via `Mailbox::serialize_and_send_once`.
45//! - **S6.** View semantics are stable -- Actor view uses live
46//!   structural state + supervision children; Entity view uses
47//!   published properties + domain children.
48//! - **S7.** `QueryChild` must work without actor handlers -- system
49//!   procs are resolved via a per-actor callback on `InstanceCell`.
50//! - **S8.** Published properties are constrained -- actors cannot
51//!   publish `Root` or `Error` payloads (only `Host` and `Proc`
52//!   variants).
53//! - **S9.** Port binding is single source of truth -- the introspect
54//!   port is bound exactly once via `bind_handler_port()` in
55//!   `Instance::new()`.
56//! - **S10.** Introspect receiver lifecycle -- created in
57//!   `Instance::new()`, spawned in `start()`, dropped in
58//!   `child_instance()`.
59//! - **S11.** Terminated snapshots do not keep actors resolvable --
60//!   `store_terminated_snapshot` writes to the proc's snapshot map,
61//!   not the instances map. `resolve_actor_ref` checks terminal
62//!   status independently and is unaffected by snapshot storage.
63//! - **S12.** Introspection must not impair actor liveness --
64//!   introspection queries (including DashMap reads for actor
65//!   enumeration) must not cause convoy starvation or scheduling
66//!   delays that stall concurrent actor spawn/stop operations.
67//!
68//! ## Introspection key invariants (IK-*)
69//!
70//! - **IK-1 (metadata completeness):** Every actor-runtime
71//!   introspection key must carry `@meta(INTROSPECT = ...)` with
72//!   non-empty `name` and `desc`.
73//! - **IK-2 (short-name uniqueness):** No two introspection keys may
74//!   share the same `IntrospectAttr.name`. Duplicates would break the
75//!   FQ-to-short HTTP remap and schema output.
76//!
77//! ## Failure introspection invariants (FI-*)
78//!
79//! The FailureInfo presentation type lives in
80//! `hyperactor_mesh::introspect`; these invariants are documented
81//! here because the enforcement sites are in hyperactor (`proc.rs`
82//! `serve()`, `live_actor_payload`).
83//!
84//! - **FI-1 (event-before-status):** All `InstanceCell` state that
85//!   `live_actor_payload` reads must be written BEFORE
86//!   `change_status()` transitions to terminal.
87//! - **FI-2 (write-once):** `InstanceCellState::supervision_event` is
88//!   written at most once per actor lifetime.
89//! - **FI-3 (failure attrs <-> status):** Failure attrs are present
90//!   iff status is `"failed"`.
91//! - **FI-4 (is_propagated <-> root_cause_actor):**
92//!   `failure_is_propagated == true` iff `failure_root_cause_actor !=
93//!   this_actor_id`.
94//! - **FI-5 (is_poisoned <-> failed_actor_count):** `is_poisoned ==
95//!   true` iff `failed_actor_count > 0`.
96//! - **FI-6 (clean stop = no artifacts):** When an actor stops
97//!   cleanly, `supervision_event` is `None`, failure attrs are
98//!   absent, and the actor does not contribute to
99//!   `failed_actor_count`.
100//! - **FI-7 (propagated-stopped-root-cause):** When a failed actor's
101//!   supervision chain bottoms out in a `Stopped` child event,
102//!   structured failure metadata must still name the stopped child as
103//!   `failure_root_cause_actor`.
104//! - **FI-8 (propagation-classification):** `failure_is_propagated`
105//!   is derived from root-cause actor identity; a parent that failed
106//!   due to a child's event must report `failure_is_propagated ==
107//!   true`.
108//!
109//! ## Attrs view invariants (AV-*)
110//!
111//! These govern the typed view layer (`ActorAttrsView`). The full
112//! AV-* / DP-* family is documented in `hyperactor_mesh::introspect`;
113//! the subset relevant to this crate:
114//!
115//! - **AV-1 (view-roundtrip):** For each view V,
116//!   `V::from_attrs(&v.to_attrs()) == Ok(v)`.
117//! - **AV-2 (required-key-strictness):** `from_attrs` fails iff
118//!   required keys for that view are missing.
119//! - **AV-3 (unknown-key-tolerance):** Unknown attrs keys must not
120//!   affect successful decode outcome.
121
122use std::fmt;
123use std::str::FromStr;
124use std::time::SystemTime;
125
126use hyperactor_config::Attrs;
127use hyperactor_config::INTROSPECT;
128use hyperactor_config::IntrospectAttr;
129use hyperactor_config::declare_attrs;
130use serde::Deserialize;
131use serde::Serialize;
132use typeuri::Named;
133
134use crate::ActorAddr;
135use crate::Addr;
136use crate::AddrParseError;
137use crate::InstanceCell;
138use crate::OncePortRef;
139use crate::ProcAddr;
140/// Typed reference to an introspectable entity.
141///
142/// This is the generic hyperactor layer — it knows about procs and
143/// actors, not mesh-specific concepts like root or host.
144///
145/// Port references are intentionally excluded — introspection
146/// does not address individual ports.
147#[derive(Debug, Clone, PartialEq, Eq, Hash, Serialize, Deserialize, Named)]
148pub enum IntrospectRef {
149    /// A proc reference.
150    Proc(ProcAddr),
151    /// An actor reference.
152    Actor(ActorAddr),
153}
154hyperactor_config::impl_attrvalue!(IntrospectRef);
155
156/// Error returned when parsing an [`IntrospectRef`].
157#[derive(Debug, thiserror::Error)]
158pub enum IntrospectRefParseError {
159    /// The address text could not be parsed.
160    #[error(transparent)]
161    Addr(#[from] AddrParseError),
162    /// Port references are not introspectable.
163    #[error("port references are not valid introspection references")]
164    PortNotAllowed,
165}
166
167impl fmt::Display for IntrospectRef {
168    fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result {
169        match self {
170            Self::Proc(id) => fmt::Display::fmt(id, f),
171            Self::Actor(id) => fmt::Display::fmt(id, f),
172        }
173    }
174}
175
176impl FromStr for IntrospectRef {
177    type Err = IntrospectRefParseError;
178
179    fn from_str(s: &str) -> Result<Self, Self::Err> {
180        let r: Addr = s.parse()?;
181        match r {
182            Addr::Proc(id) => Ok(Self::Proc(id)),
183            Addr::Actor(id) => Ok(Self::Actor(id)),
184            Addr::Port(_) => Err(IntrospectRefParseError::PortNotAllowed),
185        }
186    }
187}
188
189impl From<ProcAddr> for IntrospectRef {
190    fn from(id: ProcAddr) -> Self {
191        Self::Proc(id)
192    }
193}
194
195impl From<ActorAddr> for IntrospectRef {
196    fn from(id: ActorAddr) -> Self {
197        Self::Actor(id)
198    }
199}
200
201// Introspection attr keys — actor-runtime concepts.
202//
203// These keys are populated by the introspect handler from
204// InstanceCell data. Mesh-topology keys (node_type, addr, num_procs,
205// etc.) are declared in hyperactor_mesh::introspect.
206//
207// Naming convention:
208//
209// - Attr names are node-type-agnostic. The `node_type` attr (from the
210//   mesh layer) identifies what kind of node it is; individual attr
211//   names don't repeat that. So `status`, not `actor_status`.
212// - Related attrs share a prefix to form a group. The `failure_*`
213//   keys decompose failure info into flat attrs — the `failure_`
214//   prefix groups them semantically.
215// - `actor_type` is an exception: the `actor_` prefix disambiguates
216//   it from `node_type` (mesh-layer concept). `actor_type` is the
217//   Rust actor type name; `node_type` is the topology role.
218// - Use real types where possible (e.g. SystemTime for timestamps),
219//   not String. Serialization format is a presentation concern.
220// - Internal key names are fully-qualified by `declare_attrs!`
221//   (module_path + attr constant), e.g.
222//   `hyperactor::introspect::status`.
223// - HTTP/schema public key names come from `@meta(INTROSPECT =
224//   IntrospectAttr { name, desc })`. Keep `name` explicit so API
225//   stability is decoupled from internal refactors.
226//
227// See IK-1 (metadata completeness) and IK-2 (short-name uniqueness)
228// in module doc.
229declare_attrs! {
230    /// Actor lifecycle status: "running", "stopped", "failed".
231    ///
232    /// Together with `STATUS_REASON`, these two attrs replace the
233    /// former `actor_status` prefix protocol (`"stopped:reason"`,
234    /// `"failed:reason"`) with structured fields, eliminating string
235    /// prefix parsing in consumers.
236    @meta(INTROSPECT = IntrospectAttr {
237        name: "status".into(),
238        desc: "Actor lifecycle status: running, stopped, failed".into(),
239    })
240    pub attr STATUS: String;
241
242    /// Reason for stop/failure (absent when running).
243    @meta(INTROSPECT = IntrospectAttr {
244        name: "status_reason".into(),
245        desc: "Reason for stop/failure (absent when running)".into(),
246    })
247    pub attr STATUS_REASON: String;
248
249    /// Fully-qualified actor type name.
250    @meta(INTROSPECT = IntrospectAttr {
251        name: "actor_type".into(),
252        desc: "Fully-qualified actor type name".into(),
253    })
254    pub attr ACTOR_TYPE: String;
255
256    /// Number of messages processed by this actor.
257    @meta(INTROSPECT = IntrospectAttr {
258        name: "messages_processed".into(),
259        desc: "Number of messages processed by this actor".into(),
260    })
261    pub attr MESSAGES_PROCESSED: u64 = 0;
262
263    /// Timestamp when this actor was created.
264    @meta(INTROSPECT = IntrospectAttr {
265        name: "created_at".into(),
266        desc: "Timestamp when this actor was created".into(),
267    })
268    pub attr CREATED_AT: SystemTime;
269
270    /// Name of the last message handler invoked.
271    @meta(INTROSPECT = IntrospectAttr {
272        name: "last_handler".into(),
273        desc: "Name of the last message handler invoked".into(),
274    })
275    pub attr LAST_HANDLER: String;
276
277    /// Total CPU time in message handlers (microseconds).
278    @meta(INTROSPECT = IntrospectAttr {
279        name: "total_processing_time_us".into(),
280        desc: "Total CPU time in message handlers (microseconds)".into(),
281    })
282    pub attr TOTAL_PROCESSING_TIME_US: u64 = 0;
283
284    /// Flight recorder JSON (recent trace events).
285    @meta(INTROSPECT = IntrospectAttr {
286        name: "flight_recorder".into(),
287        desc: "Flight recorder JSON (recent trace events)".into(),
288    })
289    pub attr FLIGHT_RECORDER: String;
290
291    /// Whether this actor is infrastructure/system.
292    @meta(INTROSPECT = IntrospectAttr {
293        name: "is_system".into(),
294        desc: "Whether this actor is infrastructure/system".into(),
295    })
296    pub attr IS_SYSTEM: bool = false;
297
298    /// Child references for tree navigation. Published by
299    /// infrastructure actors (HostMeshAgent, ProcAgent) so the
300    /// Entity view can return children without parsing mesh-layer keys.
301    @meta(INTROSPECT = IntrospectAttr {
302        name: "children".into(),
303        desc: "Child references for tree navigation".into(),
304    })
305    pub attr CHILDREN: Vec<IntrospectRef>;
306
307    /// Machine-readable error code for error nodes.
308    @meta(INTROSPECT = IntrospectAttr {
309        name: "error_code".into(),
310        desc: "Machine-readable error code (e.g. not_found)".into(),
311    })
312    pub attr ERROR_CODE: String;
313
314    /// Human-readable error message for error nodes.
315    @meta(INTROSPECT = IntrospectAttr {
316        name: "error_message".into(),
317        desc: "Human-readable error message".into(),
318    })
319    pub attr ERROR_MESSAGE: String;
320
321    // Failure attrs — decomposition of FailureInfo into flat attrs.
322    //
323    // - **FI-A1 (presence):** failure_* attrs are present iff
324    //   status == "failed"; absent otherwise. (Attr-level restatement
325    //   of FI-3.)
326    // - **FI-A2 (propagation):** failure_is_propagated == true iff
327    //   failure_root_cause_actor != this actor's id. (Attr-level
328    //   restatement of FI-4.)
329    // FI-1, FI-2 (write ordering) are enforced in proc.rs serve()
330    // and are unaffected by the representation change.
331    // FI-5, FI-6 are proc/mesh-level and unaffected.
332
333    /// Failure error message.
334    @meta(INTROSPECT = IntrospectAttr {
335        name: "failure_error_message".into(),
336        desc: "Failure error message".into(),
337    })
338    pub attr FAILURE_ERROR_MESSAGE: String;
339
340    /// Actor that caused the failure (root cause).
341    @meta(INTROSPECT = IntrospectAttr {
342        name: "failure_root_cause_actor".into(),
343        desc: "Actor that caused the failure (root cause)".into(),
344    })
345    pub attr FAILURE_ROOT_CAUSE_ACTOR: ActorAddr;
346
347    /// Name of root cause actor.
348    @meta(INTROSPECT = IntrospectAttr {
349        name: "failure_root_cause_name".into(),
350        desc: "Name of root cause actor".into(),
351    })
352    pub attr FAILURE_ROOT_CAUSE_NAME: String;
353
354    /// Timestamp when failure occurred.
355    @meta(INTROSPECT = IntrospectAttr {
356        name: "failure_occurred_at".into(),
357        desc: "Timestamp when failure occurred".into(),
358    })
359    pub attr FAILURE_OCCURRED_AT: SystemTime;
360
361    /// Whether the failure was propagated from a child.
362    @meta(INTROSPECT = IntrospectAttr {
363        name: "failure_is_propagated".into(),
364        desc: "Whether the failure was propagated from a child".into(),
365    })
366    pub attr FAILURE_IS_PROPAGATED: bool = false;
367}
368
369// See FI-1 through FI-8 in module doc.
370
371/// Error from decoding an `Attrs` bag into a typed view.
372#[derive(Debug, Clone, PartialEq)]
373pub enum AttrsViewError {
374    /// A required key was absent (and has no default).
375    MissingKey {
376        /// The attr key that was absent.
377        key: &'static str,
378    },
379    /// A cross-field coherence check failed.
380    InvariantViolation {
381        /// Invariant label (e.g. "IA-4").
382        label: &'static str,
383        /// Human-readable description of the violation.
384        detail: String,
385    },
386}
387
388impl fmt::Display for AttrsViewError {
389    fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result {
390        match self {
391            Self::MissingKey { key } => write!(f, "missing required key: {key}"),
392            Self::InvariantViolation { label, detail } => {
393                write!(f, "invariant {label} violated: {detail}")
394            }
395        }
396    }
397}
398
399impl std::error::Error for AttrsViewError {}
400
401impl AttrsViewError {
402    /// Convenience constructor for a missing required key.
403    pub fn missing(key: &'static str) -> Self {
404        Self::MissingKey { key }
405    }
406
407    /// Convenience constructor for an invariant violation.
408    pub fn invariant(label: &'static str, detail: String) -> Self {
409        Self::InvariantViolation { label, detail }
410    }
411}
412
413/// Structured failure fields decoded from `FAILURE_*` attrs.
414#[derive(Debug, Clone, PartialEq)]
415pub struct FailureAttrs {
416    /// Error message describing the failure.
417    pub error_message: String,
418    /// Actor that caused the failure (root cause).
419    pub root_cause_actor: ActorAddr,
420    /// Display name of the root-cause actor, if available.
421    pub root_cause_name: Option<String>,
422    /// When the failure occurred.
423    pub occurred_at: SystemTime,
424    /// Whether this failure was propagated from a child.
425    pub is_propagated: bool,
426}
427
428/// Typed view over attrs for an actor node.
429#[derive(Debug, Clone, PartialEq)]
430pub struct ActorAttrsView {
431    /// Lifecycle status: "running", "stopped", "failed".
432    pub status: String,
433    /// Reason for stop/failure, if any.
434    pub status_reason: Option<String>,
435    /// Fully-qualified actor type name.
436    pub actor_type: String,
437    /// Number of messages processed.
438    pub messages_processed: u64,
439    /// When this actor was created.
440    pub created_at: Option<SystemTime>,
441    /// Name of the last message handler invoked.
442    pub last_handler: Option<String>,
443    /// Total CPU time in message handlers (microseconds).
444    pub total_processing_time_us: u64,
445    /// Flight recorder JSON, if available.
446    pub flight_recorder: Option<String>,
447    /// Whether this is a system/infrastructure actor.
448    pub is_system: bool,
449    /// Failure details, present iff status == "failed".
450    pub failure: Option<FailureAttrs>,
451}
452
453impl ActorAttrsView {
454    /// Decode from an `Attrs` bag (AV-2, AV-3). Requires `STATUS`
455    /// and `ACTOR_TYPE`. Enforces IA-3 (status_reason must not be
456    /// present for non-terminal status), IA-4 (failure attrs iff
457    /// failed), and failure completeness (if any required failure
458    /// key is present, all three required keys must be).
459    pub fn from_attrs(attrs: &Attrs) -> Result<Self, AttrsViewError> {
460        let status = attrs
461            .get(STATUS)
462            .ok_or_else(|| AttrsViewError::missing("status"))?
463            .clone();
464        let status_reason = attrs.get(STATUS_REASON).cloned();
465        let actor_type = attrs
466            .get(ACTOR_TYPE)
467            .ok_or_else(|| AttrsViewError::missing("actor_type"))?
468            .clone();
469        let messages_processed = *attrs.get(MESSAGES_PROCESSED).unwrap_or(&0);
470        let created_at = attrs.get(CREATED_AT).copied();
471        let last_handler = attrs.get(LAST_HANDLER).cloned();
472        let total_processing_time_us = *attrs.get(TOTAL_PROCESSING_TIME_US).unwrap_or(&0);
473        let flight_recorder = attrs.get(FLIGHT_RECORDER).cloned();
474        let is_system = *attrs.get(IS_SYSTEM).unwrap_or(&false);
475
476        // IA-3 (one-sided): status_reason must not be present for
477        // non-terminal status. The converse is not enforced —
478        // terminal status without a reason is valid (clean shutdown).
479        let is_terminal = status == "stopped" || status == "failed";
480        if status_reason.is_some() && !is_terminal {
481            return Err(AttrsViewError::invariant(
482                "IA-3",
483                format!(
484                    "status_reason present but status is '{status}' (expected stopped or failed)"
485                ),
486            ));
487        }
488
489        // Decode failure attrs. If any of the three required
490        // failure keys is present, require all three.
491        // FAILURE_IS_PROPAGATED has a declare_attrs! default of
492        // false, so it always resolves via attrs.get() and needs
493        // no explicit presence check. FAILURE_ROOT_CAUSE_NAME is
494        // genuinely optional.
495        let has_any_failure = attrs.get(FAILURE_ERROR_MESSAGE).is_some()
496            || attrs.get(FAILURE_ROOT_CAUSE_ACTOR).is_some()
497            || attrs.get(FAILURE_OCCURRED_AT).is_some();
498
499        let failure = if has_any_failure {
500            let error_message = attrs
501                .get(FAILURE_ERROR_MESSAGE)
502                .ok_or_else(|| AttrsViewError::missing("failure_error_message"))?
503                .clone();
504            let root_cause_actor = attrs
505                .get(FAILURE_ROOT_CAUSE_ACTOR)
506                .ok_or_else(|| AttrsViewError::missing("failure_root_cause_actor"))?
507                .clone();
508            let root_cause_name = attrs.get(FAILURE_ROOT_CAUSE_NAME).cloned();
509            let occurred_at = *attrs
510                .get(FAILURE_OCCURRED_AT)
511                .ok_or_else(|| AttrsViewError::missing("failure_occurred_at"))?;
512            // Default false: failure originated at this actor.
513            let is_propagated = *attrs.get(FAILURE_IS_PROPAGATED).unwrap_or(&false);
514            Some(FailureAttrs {
515                error_message,
516                root_cause_actor,
517                root_cause_name,
518                occurred_at,
519                is_propagated,
520            })
521        } else {
522            None
523        };
524
525        // IA-4: failure attrs present iff status == "failed".
526        if status == "failed" && failure.is_none() {
527            return Err(AttrsViewError::invariant(
528                "IA-4",
529                "status is 'failed' but no failure_* attrs present".to_string(),
530            ));
531        }
532        if status != "failed" && failure.is_some() {
533            return Err(AttrsViewError::invariant(
534                "IA-4",
535                format!("status is '{status}' but failure_* attrs are present"),
536            ));
537        }
538
539        Ok(Self {
540            status,
541            status_reason,
542            actor_type,
543            messages_processed,
544            created_at,
545            last_handler,
546            total_processing_time_us,
547            flight_recorder,
548            is_system,
549            failure,
550        })
551    }
552
553    /// Encode into an `Attrs` bag (AV-1 round-trip producer).
554    pub fn to_attrs(&self) -> Attrs {
555        let mut attrs = Attrs::new();
556        attrs.set(STATUS, self.status.clone());
557        if let Some(reason) = &self.status_reason {
558            attrs.set(STATUS_REASON, reason.clone());
559        }
560        attrs.set(ACTOR_TYPE, self.actor_type.clone());
561        attrs.set(MESSAGES_PROCESSED, self.messages_processed);
562        if let Some(t) = self.created_at {
563            attrs.set(CREATED_AT, t);
564        }
565        if let Some(handler) = &self.last_handler {
566            attrs.set(LAST_HANDLER, handler.clone());
567        }
568        attrs.set(TOTAL_PROCESSING_TIME_US, self.total_processing_time_us);
569        if let Some(fr) = &self.flight_recorder {
570            attrs.set(FLIGHT_RECORDER, fr.clone());
571        }
572        attrs.set(IS_SYSTEM, self.is_system);
573        if let Some(fi) = &self.failure {
574            attrs.set(FAILURE_ERROR_MESSAGE, fi.error_message.clone());
575            attrs.set(FAILURE_ROOT_CAUSE_ACTOR, fi.root_cause_actor.clone());
576            if let Some(name) = &fi.root_cause_name {
577                attrs.set(FAILURE_ROOT_CAUSE_NAME, name.clone());
578            }
579            attrs.set(FAILURE_OCCURRED_AT, fi.occurred_at);
580            attrs.set(FAILURE_IS_PROPAGATED, fi.is_propagated);
581        }
582        attrs
583    }
584}
585
586/// Internal introspection result. Carries attrs as a JSON string.
587/// The mesh layer constructs the API-facing `NodePayload` (with
588/// `properties`) from this via `derive_properties`.
589///
590/// This is the internal wire type — it travels over handler ports
591/// via `IntrospectMessage`. The presentation-layer `NodePayload`
592/// (with `NodeProperties`) lives in `hyperactor_mesh::introspect`.
593#[derive(Debug, Clone, PartialEq, Serialize, Deserialize, Named)]
594pub struct IntrospectResult {
595    /// Addr identifying this node.
596    pub identity: IntrospectRef,
597    /// JSON-serialized `Attrs` bag containing introspection attributes.
598    pub attrs: String,
599    /// Child references the client can follow to descend the tree.
600    pub children: Vec<IntrospectRef>,
601    /// Parent reference for upward navigation.
602    pub parent: Option<IntrospectRef>,
603    /// When this data was captured.
604    pub as_of: SystemTime,
605}
606wirevalue::register_type!(IntrospectResult);
607
608/// Context for introspection query - what aspect of the actor to
609/// describe.
610///
611/// Infrastructure actors (e.g., ProcAgent, HostAgent)
612/// have dual nature: they manage entities (Proc, Host) while also
613/// being actors themselves. IntrospectView allows callers to
614/// specify which aspect to query.
615// TODO(monarch-introspection): IntrospectView currently uses
616// Entity/Actor naming. Consider renaming to runtime-neutral query
617// modes (e.g. Published/Runtime) to avoid mesh-domain wording in
618// hyperactor while preserving behavior and wire compatibility.
619#[derive(Debug, Clone, Copy, PartialEq, Eq, Serialize, Deserialize, Named)]
620pub enum IntrospectView {
621    /// Return managed-entity properties (Proc, Host, etc.) for
622    /// infrastructure actors.
623    Entity,
624    /// Return standard actor properties (status, messages_processed,
625    /// flight_recorder).
626    Actor,
627}
628wirevalue::register_type!(IntrospectView);
629
630/// Introspection query sent to any actor.
631///
632/// `Query` asks the actor to describe itself. `QueryChild` asks the
633/// actor to describe one of its non-addressable children — an entity
634/// that appears in the navigation tree but has no mailbox of its own
635/// (e.g. a system proc owned by a host). The parent actor answers on
636/// the child's behalf.
637#[derive(Debug, Clone, PartialEq, Serialize, Deserialize, Named)]
638pub enum IntrospectMessage {
639    /// "Describe yourself."
640    Query {
641        /// View context - Entity or Actor.
642        view: IntrospectView,
643        /// Reply port receiving the actor's self-description.
644        reply: OncePortRef<IntrospectResult>,
645    },
646    /// "Describe one of your children."
647    QueryChild {
648        /// Addr identifying the child to describe.
649        child_ref: Addr,
650        /// Reply port receiving the child's description.
651        reply: OncePortRef<IntrospectResult>,
652    },
653}
654wirevalue::register_type!(IntrospectMessage);
655
656/// Structured tracing event from the actor-local flight recorder.
657///
658/// Deserialization target for the `FLIGHT_RECORDER` attrs JSON string.
659#[derive(Debug, Clone, Serialize, Deserialize)]
660pub struct RecordedEvent {
661    /// ISO 8601 timestamp of the event.
662    pub timestamp: String,
663    /// Monotonic sequence number for ordering.
664    #[serde(default)]
665    pub seq: usize,
666    /// Event level (INFO, DEBUG, etc.).
667    pub level: String,
668    /// Event target (module path).
669    #[serde(default)]
670    pub target: String,
671    /// Event name.
672    pub name: String,
673    /// Event fields as JSON.
674    pub fields: serde_json::Value,
675}
676
677/// Format a [`SystemTime`] as an ISO 8601 timestamp with millisecond
678/// precision.
679pub fn format_timestamp(time: SystemTime) -> String {
680    humantime::format_rfc3339_millis(time).to_string()
681}
682
683/// Build a JSON-serialized `Attrs` string from values already
684/// computed by `live_actor_payload`. Reuses the same data — no
685/// redundant reads from `InstanceCell`.
686///
687/// Populates actor-runtime keys (STATUS, ACTOR_TYPE, etc.),
688/// decomposes the status prefix protocol into STATUS + STATUS_REASON,
689/// and decomposes failure fields into individual FAILURE_* attrs.
690///
691/// Starts from a fresh `Attrs` bag — published attrs (node_type,
692/// addr, etc.) are NOT included. This ensures the Actor view
693/// produces actor-only data; the Entity view handles published
694/// attrs separately.
695/// Failure fields extracted from a supervision event.
696struct FailureSnapshot {
697    error_message: String,
698    root_cause_actor: ActorAddr,
699    root_cause_name: Option<String>,
700    occurred_at: SystemTime,
701    is_propagated: bool,
702}
703
704/// Pre-computed actor state for building the attrs JSON string.
705/// Avoids redundant InstanceCell reads — `live_actor_payload`
706/// computes these once and passes them in.
707struct ActorSnapshot {
708    status_str: String,
709    is_system: bool,
710    last_handler: Option<String>,
711    flight_recorder: Option<String>,
712    failure: Option<FailureSnapshot>,
713}
714
715fn build_actor_attrs(cell: &crate::InstanceCell, snap: &ActorSnapshot) -> String {
716    // Actor view builds a clean attrs bag with only actor-runtime
717    // keys. Published attrs (node_type, addr, etc.) belong to the
718    // Entity view — they are NOT merged here. This ensures that
719    // e.g. a HostMeshAgent resolved via Actor view produces Actor
720    // properties, not Host properties.
721    let mut attrs = hyperactor_config::Attrs::new();
722
723    // IA-3: status_reason present iff status carries a reason.
724    if let Some(reason) = snap.status_str.strip_prefix("stopped:") {
725        attrs.set(STATUS, "stopped".to_string());
726        attrs.set(STATUS_REASON, reason.trim().to_string());
727    } else if let Some(reason) = snap.status_str.strip_prefix("failed:") {
728        attrs.set(STATUS, "failed".to_string());
729        attrs.set(STATUS_REASON, reason.trim().to_string());
730    } else {
731        attrs.set(STATUS, snap.status_str.clone());
732        // IA-3: no status_reason for non-terminal states —
733        // guaranteed by fresh Attrs bag.
734    }
735
736    attrs.set(ACTOR_TYPE, cell.actor_type_name().to_string());
737    attrs.set(MESSAGES_PROCESSED, cell.num_processed_messages());
738    attrs.set(CREATED_AT, cell.created_at());
739    attrs.set(TOTAL_PROCESSING_TIME_US, cell.total_processing_time_us());
740    attrs.set(IS_SYSTEM, snap.is_system);
741
742    if let Some(handler) = &snap.last_handler {
743        attrs.set(LAST_HANDLER, handler.clone());
744    }
745    if let Some(fr) = &snap.flight_recorder {
746        attrs.set(FLIGHT_RECORDER, fr.clone());
747    }
748
749    // IA-4 / FI-A1: failure attrs present iff status == "failed".
750    if let Some(fi) = &snap.failure {
751        attrs.set(FAILURE_ERROR_MESSAGE, fi.error_message.clone());
752        attrs.set(FAILURE_ROOT_CAUSE_ACTOR, fi.root_cause_actor.clone());
753        if let Some(name) = &fi.root_cause_name {
754            attrs.set(FAILURE_ROOT_CAUSE_NAME, name.clone());
755        }
756        attrs.set(FAILURE_OCCURRED_AT, fi.occurred_at);
757        attrs.set(FAILURE_IS_PROPAGATED, fi.is_propagated);
758    }
759    // IA-4: failure attrs absent when not failed — guaranteed by
760    // starting from a fresh Attrs bag (no stale keys possible).
761
762    serde_json::to_string(&attrs).unwrap_or_else(|_| "{}".to_string())
763}
764
765/// Build an [`IntrospectResult`] from live [`InstanceCell`] state.
766///
767/// Reads the current live status and last handler directly from
768/// the cell. Used by the introspect task (which runs outside
769/// the actor's message loop) and by `Instance::introspect_payload`.
770pub fn live_actor_payload(cell: &InstanceCell) -> IntrospectResult {
771    let actor_id = cell.actor_addr();
772    let status = cell.status().borrow().clone();
773    let last_handler = cell.last_message_handler();
774
775    let children: Vec<IntrospectRef> = cell
776        .child_actor_ids()
777        .into_iter()
778        .map(IntrospectRef::Actor)
779        .collect();
780
781    let events = cell.recording().tail();
782    let flight_recorder_events: Vec<RecordedEvent> = events
783        .into_iter()
784        .map(|event| RecordedEvent {
785            timestamp: format_timestamp(event.time),
786            seq: event.seq,
787            level: event.metadata.level().to_string(),
788            target: event.metadata.target().to_string(),
789            name: event.metadata.name().to_string(),
790            fields: event.json_value(),
791        })
792        .collect();
793
794    let flight_recorder = if flight_recorder_events.is_empty() {
795        None
796    } else {
797        serde_json::to_string(&flight_recorder_events).ok()
798    };
799
800    let supervisor = cell
801        .parent()
802        .map(|p| IntrospectRef::Actor(p.actor_addr().clone()));
803
804    // FI-3: failure_info is computed from the same status value as
805    // actor_status, ensuring they agree on whether the actor failed.
806    let failure = if status.is_failed() {
807        cell.supervision_event().and_then(|event| {
808            let root = event.actually_failing_actor()?;
809            Some(FailureSnapshot {
810                error_message: event.actor_status.to_string(),
811                root_cause_actor: root.actor_id.clone(),
812                root_cause_name: root.display_name.clone(),
813                occurred_at: event.occurred_at,
814                is_propagated: root.actor_id != actor_id.clone(),
815            })
816        })
817    } else {
818        None
819    };
820
821    let snap = ActorSnapshot {
822        status_str: status.to_string(),
823        is_system: cell.is_system(),
824        last_handler: last_handler.map(|info| info.to_string()),
825        flight_recorder,
826        failure,
827    };
828
829    let attrs = build_actor_attrs(cell, &snap);
830
831    IntrospectResult {
832        identity: IntrospectRef::Actor(actor_id.clone()),
833        attrs,
834        children,
835        parent: supervisor,
836        as_of: SystemTime::now(),
837    }
838}
839
840/// Introspect task: runs on a dedicated tokio task per actor,
841/// handling [`IntrospectMessage`] by reading [`InstanceCell`]
842/// directly and replying through the owning [`Proc`](crate::Proc).
843///
844/// The actor's message loop never sees these messages.
845///
846/// # Invariants exercised
847///
848/// Exercises S1, S2, S4, S5, S6, S11 (see module doc).
849pub(crate) async fn serve_introspect(
850    cell: InstanceCell,
851    mut receiver: crate::mailbox::PortReceiver<IntrospectMessage>,
852) {
853    use crate::actor::ActorStatus;
854    use crate::mailbox::PortSender as _;
855
856    // Watch for terminal status so we can break the reference cycle:
857    // InstanceCellState → Ports → introspect sender → keeps receiver
858    // open → this task holds InstanceCell → InstanceCellState.
859    // Without this, a stopped actor's InstanceCellState is never
860    // dropped and the actor lingers in the proc's instances map.
861    let mut status = cell.status().clone();
862
863    loop {
864        let msg = tokio::select! {
865            msg = receiver.recv() => {
866                match msg {
867                    Ok(msg) => msg,
868                    Err(_) => {
869                        // Channel closed. If the actor reached a
870                        // terminal state, snapshot it before exiting
871                        // so it remains queryable post-mortem.
872                        if cell.status().borrow().is_terminal() {
873                            let snapshot = live_actor_payload(&cell);
874                            cell.store_terminated_snapshot(snapshot);
875                        }
876                        break;
877                    }
878                }
879            }
880            status_ref = status.wait_for(ActorStatus::is_terminal) => {
881                // Explicitly drop the Ref before calling live_actor_payload.
882                // wait_for returns a Ref that holds a read lock on the watch
883                // channel's RwLock<ActorStatus>. tokio select! uses a match
884                // internally, so the scrutinee (and its read lock) stays alive
885                // through the arm body. live_actor_payload also calls borrow(),
886                // and parking_lot's write-preferring RwLock blocks new readers
887                // once a writer is queued — causing a deadlock if InstanceState
888                // ::drop tries to write between wait_for and live_actor_payload.
889                drop(status_ref);
890                let snapshot = live_actor_payload(&cell);
891                cell.store_terminated_snapshot(snapshot);
892                break;
893            }
894        };
895
896        let result = match msg {
897            IntrospectMessage::Query { view, reply } => {
898                let payload = match view {
899                    IntrospectView::Entity => match cell.published_attrs() {
900                        Some(published) => {
901                            let attrs_json =
902                                serde_json::to_string(&published).unwrap_or_else(|_| "{}".into());
903                            let children: Vec<IntrospectRef> =
904                                published.get(CHILDREN).cloned().unwrap_or_default();
905                            IntrospectResult {
906                                identity: IntrospectRef::Actor(cell.actor_addr().clone()),
907                                attrs: attrs_json,
908                                children,
909                                parent: cell
910                                    .parent()
911                                    .map(|p| IntrospectRef::Actor(p.actor_addr().clone())),
912                                as_of: SystemTime::now(),
913                            }
914                        }
915                        None => live_actor_payload(&cell),
916                    },
917                    IntrospectView::Actor => live_actor_payload(&cell),
918                };
919                cell.proc().serialize_and_send_once(
920                    reply,
921                    payload,
922                    crate::mailbox::monitored_return_handle(),
923                )
924            }
925            IntrospectMessage::QueryChild { child_ref, reply } => {
926                let child_ref_: Addr = child_ref.clone();
927                let payload = cell.query_child(&child_ref_).unwrap_or_else(|| {
928                    let mut error_attrs = hyperactor_config::Attrs::new();
929                    error_attrs.set(ERROR_CODE, "not_found".to_string());
930                    error_attrs.set(
931                        ERROR_MESSAGE,
932                        format!("child {} not found (no callback registered)", child_ref),
933                    );
934                    // Use the queried child_ref as identity for the error node.
935                    let identity = match &child_ref {
936                        Addr::Proc(id) => IntrospectRef::Proc(id.clone()),
937                        Addr::Actor(id) => IntrospectRef::Actor(id.clone()),
938                        Addr::Port(id) => IntrospectRef::Actor(id.actor_addr()),
939                    };
940                    IntrospectResult {
941                        identity,
942                        attrs: serde_json::to_string(&error_attrs)
943                            .unwrap_or_else(|_| "{}".to_string()),
944                        children: Vec::new(),
945                        parent: None,
946                        as_of: SystemTime::now(),
947                    }
948                });
949                cell.proc().serialize_and_send_once(
950                    reply,
951                    payload,
952                    crate::mailbox::monitored_return_handle(),
953                )
954            }
955        };
956        if let Err(e) = result {
957            tracing::debug!("introspect reply failed: {e}");
958        }
959    }
960    tracing::debug!(
961        actor_id = %cell.actor_addr(),
962        "introspect task exiting"
963    );
964}
965
966#[cfg(test)]
967mod tests {
968    use super::*;
969    use crate::ActorAddr;
970    use crate::ProcAddr;
971    use crate::actor::ActorErrorKind;
972    use crate::actor::ActorStatus;
973    use crate::channel::ChannelAddr;
974    use crate::supervision::ActorSupervisionEvent;
975
976    /// Exercises IK-1 (see module doc).
977    #[test]
978    fn test_introspect_keys_are_tagged() {
979        let cases = vec![
980            ("status", STATUS.attrs()),
981            ("status_reason", STATUS_REASON.attrs()),
982            ("actor_type", ACTOR_TYPE.attrs()),
983            ("messages_processed", MESSAGES_PROCESSED.attrs()),
984            ("created_at", CREATED_AT.attrs()),
985            ("last_handler", LAST_HANDLER.attrs()),
986            ("total_processing_time_us", TOTAL_PROCESSING_TIME_US.attrs()),
987            ("flight_recorder", FLIGHT_RECORDER.attrs()),
988            ("is_system", IS_SYSTEM.attrs()),
989            ("children", CHILDREN.attrs()),
990            ("error_code", ERROR_CODE.attrs()),
991            ("error_message", ERROR_MESSAGE.attrs()),
992            ("failure_error_message", FAILURE_ERROR_MESSAGE.attrs()),
993            ("failure_root_cause_actor", FAILURE_ROOT_CAUSE_ACTOR.attrs()),
994            ("failure_root_cause_name", FAILURE_ROOT_CAUSE_NAME.attrs()),
995            ("failure_occurred_at", FAILURE_OCCURRED_AT.attrs()),
996            ("failure_is_propagated", FAILURE_IS_PROPAGATED.attrs()),
997        ];
998
999        for (expected_name, meta) in &cases {
1000            // IK-1: see module doc.
1001            let introspect = meta
1002                .get(INTROSPECT)
1003                .unwrap_or_else(|| panic!("{expected_name}: missing INTROSPECT meta-attr"));
1004            assert_eq!(
1005                introspect.name, *expected_name,
1006                "short name mismatch for {expected_name}"
1007            );
1008            assert!(
1009                !introspect.desc.is_empty(),
1010                "{expected_name}: desc should not be empty"
1011            );
1012        }
1013
1014        // Exhaustiveness: verify cases covers all INTROSPECT-tagged
1015        // keys declared in this module.
1016        use hyperactor_config::attrs::AttrKeyInfo;
1017        let registry_count = inventory::iter::<AttrKeyInfo>()
1018            .filter(|info| {
1019                info.name.starts_with("hyperactor::introspect::")
1020                    && info.meta.get(INTROSPECT).is_some()
1021            })
1022            .count();
1023        assert_eq!(
1024            cases.len(),
1025            registry_count,
1026            "test must cover all INTROSPECT-tagged keys in this module"
1027        );
1028    }
1029
1030    /// Exercises IK-2 (see module doc).
1031    #[test]
1032    fn test_introspect_short_names_are_globally_unique() {
1033        use hyperactor_config::attrs::AttrKeyInfo;
1034
1035        let mut seen = std::collections::HashMap::new();
1036        for info in inventory::iter::<AttrKeyInfo>() {
1037            let Some(introspect) = info.meta.get(INTROSPECT) else {
1038                continue;
1039            };
1040            // Metadata quality: every tagged key must have
1041            // non-empty name and desc.
1042            assert!(
1043                !introspect.name.is_empty(),
1044                "INTROSPECT key {:?} has empty name",
1045                info.name
1046            );
1047            assert!(
1048                !introspect.desc.is_empty(),
1049                "INTROSPECT key {:?} has empty desc",
1050                info.name
1051            );
1052            if let Some(prev_fq) = seen.insert(introspect.name.clone(), info.name) {
1053                panic!(
1054                    "IK-2 violation: duplicate short name {:?} declared by both {:?} and {:?}",
1055                    introspect.name, prev_fq, info.name
1056                );
1057            }
1058        }
1059    }
1060
1061    // IA-1 tests require spawning actors and live in actor.rs
1062    // where #[hyperactor::export] and test infrastructure are
1063    // available. IA-3 and IA-4 are tested below at the view level.
1064
1065    fn running_actor_attrs() -> Attrs {
1066        let mut attrs = Attrs::new();
1067        attrs.set(STATUS, "running".to_string());
1068        attrs.set(ACTOR_TYPE, "MyActor".to_string());
1069        attrs.set(MESSAGES_PROCESSED, 42u64);
1070        attrs.set(CREATED_AT, SystemTime::UNIX_EPOCH);
1071        attrs.set(IS_SYSTEM, false);
1072        attrs
1073    }
1074
1075    fn test_actor_id(proc_name: &str, actor_name: &str) -> ActorAddr {
1076        ProcAddr::singleton(ChannelAddr::Local(0), proc_name).actor_addr(actor_name)
1077    }
1078
1079    fn failed_actor_attrs() -> Attrs {
1080        let mut attrs = running_actor_attrs();
1081        attrs.set(STATUS, "failed".to_string());
1082        attrs.set(STATUS_REASON, "something broke".to_string());
1083        attrs.set(FAILURE_ERROR_MESSAGE, "boom".to_string());
1084        attrs.set(FAILURE_ROOT_CAUSE_ACTOR, test_actor_id("proc", "other"));
1085        attrs.set(FAILURE_ROOT_CAUSE_NAME, "OtherActor".to_string());
1086        attrs.set(FAILURE_OCCURRED_AT, SystemTime::UNIX_EPOCH);
1087        attrs.set(FAILURE_IS_PROPAGATED, true);
1088        attrs
1089    }
1090
1091    /// AV-1: from_attrs(to_attrs(v)) == v.
1092    #[test]
1093    fn test_actor_view_round_trip_running() {
1094        let view = ActorAttrsView::from_attrs(&running_actor_attrs()).unwrap();
1095        assert_eq!(view.status, "running");
1096        assert_eq!(view.actor_type, "MyActor");
1097        assert_eq!(view.messages_processed, 42);
1098        assert!(view.failure.is_none());
1099
1100        let round_tripped = ActorAttrsView::from_attrs(&view.to_attrs()).unwrap();
1101        assert_eq!(round_tripped, view);
1102    }
1103
1104    /// AV-1.
1105    #[test]
1106    fn test_actor_view_round_trip_failed() {
1107        let view = ActorAttrsView::from_attrs(&failed_actor_attrs()).unwrap();
1108        assert_eq!(view.status, "failed");
1109        let fi = view.failure.as_ref().unwrap();
1110        assert_eq!(fi.error_message, "boom");
1111        assert!(fi.is_propagated);
1112
1113        let round_tripped = ActorAttrsView::from_attrs(&view.to_attrs()).unwrap();
1114        assert_eq!(round_tripped, view);
1115    }
1116
1117    /// AV-2: missing required key rejected.
1118    #[test]
1119    fn test_actor_view_missing_status() {
1120        let mut attrs = Attrs::new();
1121        attrs.set(ACTOR_TYPE, "X".to_string());
1122        let err = ActorAttrsView::from_attrs(&attrs).unwrap_err();
1123        assert_eq!(err, AttrsViewError::MissingKey { key: "status" });
1124    }
1125
1126    /// AV-2.
1127    #[test]
1128    fn test_actor_view_missing_actor_type() {
1129        let mut attrs = Attrs::new();
1130        attrs.set(STATUS, "running".to_string());
1131        let err = ActorAttrsView::from_attrs(&attrs).unwrap_err();
1132        assert_eq!(err, AttrsViewError::MissingKey { key: "actor_type" });
1133    }
1134
1135    #[test]
1136    fn test_actor_view_ia3_rejects_reason_on_running() {
1137        let mut attrs = running_actor_attrs();
1138        attrs.set(STATUS_REASON, "should not be here".to_string());
1139        let err = ActorAttrsView::from_attrs(&attrs).unwrap_err();
1140        assert!(matches!(
1141            err,
1142            AttrsViewError::InvariantViolation { label: "IA-3", .. }
1143        ));
1144    }
1145
1146    #[test]
1147    fn test_actor_view_ia3_allows_terminal_without_reason() {
1148        let mut attrs = running_actor_attrs();
1149        attrs.set(STATUS, "stopped".to_string());
1150        // No status_reason — should be fine.
1151        let view = ActorAttrsView::from_attrs(&attrs).unwrap();
1152        assert_eq!(view.status, "stopped");
1153        assert!(view.status_reason.is_none());
1154    }
1155
1156    #[test]
1157    fn test_actor_view_ia4_rejects_failed_without_failure_attrs() {
1158        let mut attrs = running_actor_attrs();
1159        attrs.set(STATUS, "failed".to_string());
1160        // No failure_* keys.
1161        let err = ActorAttrsView::from_attrs(&attrs).unwrap_err();
1162        assert!(matches!(
1163            err,
1164            AttrsViewError::InvariantViolation { label: "IA-4", .. }
1165        ));
1166    }
1167
1168    #[test]
1169    fn test_actor_view_ia4_rejects_failure_attrs_on_running() {
1170        let mut attrs = running_actor_attrs();
1171        attrs.set(FAILURE_ERROR_MESSAGE, "boom".to_string());
1172        attrs.set(FAILURE_ROOT_CAUSE_ACTOR, test_actor_id("proc", "x"));
1173        attrs.set(FAILURE_OCCURRED_AT, SystemTime::UNIX_EPOCH);
1174        let err = ActorAttrsView::from_attrs(&attrs).unwrap_err();
1175        assert!(matches!(
1176            err,
1177            AttrsViewError::InvariantViolation { label: "IA-4", .. }
1178        ));
1179    }
1180
1181    /// AV-2: partial failure set → missing key.
1182    #[test]
1183    fn test_actor_view_partial_failure_attrs_rejected() {
1184        let mut attrs = running_actor_attrs();
1185        attrs.set(STATUS, "failed".to_string());
1186        // Only one of the three required failure keys.
1187        attrs.set(FAILURE_ERROR_MESSAGE, "boom".to_string());
1188        let err = ActorAttrsView::from_attrs(&attrs).unwrap_err();
1189        assert_eq!(
1190            err,
1191            AttrsViewError::MissingKey {
1192                key: "failure_root_cause_actor"
1193            }
1194        );
1195    }
1196
1197    /// Exercises FI-7 and FI-8 (see module doc): when a parent fails
1198    /// due to an unhandled Stopped child event, structured failure
1199    /// attrs must name the stopped child as
1200    /// `failure_root_cause_actor` (FI-7) and report
1201    /// `failure_is_propagated == true` (FI-8).
1202    ///
1203    /// Partially white-box: re-creates `FailureSnapshot` construction
1204    /// from `live_actor_payload` because that function requires an
1205    /// `InstanceCell`. This test will fail if
1206    /// `actually_failing_actor()` regresses, because that helper is
1207    /// the shared decision point for root-cause attribution. See
1208    /// `test_propagated_failure_info` in `proc.rs` for end-to-end
1209    /// integration coverage.
1210    #[test]
1211    fn test_fi7_fi8_propagated_stopped_child() {
1212        let proc_id = ProcAddr::singleton(ChannelAddr::Local(0), "test_proc");
1213        let child_id = proc_id.actor_addr("proc_agent");
1214        let parent_id = proc_id.actor_addr("mesh_actor");
1215
1216        let child_event = ActorSupervisionEvent::new(
1217            child_id.clone(),
1218            Some("proc_agent".into()),
1219            ActorStatus::Stopped("host died".into()),
1220            None,
1221        );
1222        let parent_event = ActorSupervisionEvent::new(
1223            parent_id.clone(),
1224            Some("mesh_actor".into()),
1225            ActorStatus::Failed(ActorErrorKind::UnhandledSupervisionEvent(Box::new(
1226                child_event,
1227            ))),
1228            None,
1229        );
1230
1231        // -- reproduce FailureSnapshot construction (same logic as
1232        // live_actor_payload) --
1233        let root = parent_event
1234            .actually_failing_actor()
1235            .expect("parent_event is a failure");
1236        let snap = FailureSnapshot {
1237            error_message: parent_event.actor_status.to_string(),
1238            root_cause_actor: root.actor_id.clone(),
1239            root_cause_name: root.display_name.clone(),
1240            occurred_at: parent_event.occurred_at,
1241            is_propagated: root.actor_id != parent_id,
1242        };
1243
1244        // FI-7: failure_root_cause_actor is the stopped child.
1245        assert_eq!(snap.root_cause_actor, child_id);
1246        // FI-8: failure_is_propagated is true.
1247        assert!(snap.is_propagated);
1248        // root_cause_name pinned before round-trip.
1249        assert_eq!(snap.root_cause_name.as_deref(), Some("proc_agent"));
1250
1251        // -- attrs round-trip through ActorAttrsView --
1252        let mut attrs = failed_actor_attrs();
1253        attrs.set(FAILURE_ERROR_MESSAGE, snap.error_message);
1254        attrs.set(FAILURE_ROOT_CAUSE_ACTOR, snap.root_cause_actor.clone());
1255        if let Some(name) = &snap.root_cause_name {
1256            attrs.set(FAILURE_ROOT_CAUSE_NAME, name.clone());
1257        }
1258        attrs.set(FAILURE_OCCURRED_AT, snap.occurred_at);
1259        attrs.set(FAILURE_IS_PROPAGATED, snap.is_propagated);
1260
1261        let view = ActorAttrsView::from_attrs(&attrs).unwrap();
1262        assert_eq!(view.status, "failed");
1263        let fi = view.failure.as_ref().expect("failure_info must be present");
1264        // FI-7: failure_root_cause_actor survives attrs round-trip.
1265        assert_eq!(fi.root_cause_actor, child_id);
1266        // FI-8: failure_is_propagated survives attrs round-trip.
1267        assert!(fi.is_propagated);
1268        // root_cause_name also survives.
1269        assert_eq!(fi.root_cause_name.as_deref(), Some("proc_agent"));
1270    }
1271}