Note
Go to the end to download the full example code
Debugging Python Actors with pdb in Monarch#
Monarch supports pdb
debugging for python actor meshes. This guide demonstrates
how to debug distributed actors using Python’s built-in debugger with breakpoints,
step-through debugging, and interactive debugging sessions. It includes:
Setting up actors with breakpoints for debugging
Accessing the Monarch debugger and listing active breakpoints
Attaching to specific actors and using pdb commands
Broadcasting commands to multiple actors
Post-mortem debugging capabilities
Defining an Actor with Breakpoints#
To debug an actor, simply define your python actor and insert typical breakpoints
in the relevant endpoint that you want to debug using Python’s built-in breakpoint()
.
Note: There is a known bug where breakpoints will not work if they are defined inside actors spawned on a proc mesh that was allocated from inside a different proc mesh. This will be resolved in the near future.
from monarch.actor import Actor, current_rank, endpoint, this_host
def _bad_rank():
raise ValueError("bad rank")
def _debugee_actor_internal(rank):
if rank % 4 == 0:
breakpoint() # noqa
rank += 1
return rank
elif rank % 4 == 1:
breakpoint() # noqa
rank += 2
return rank
elif rank % 4 == 2:
breakpoint() # noqa
rank += 3
_bad_rank()
elif rank % 4 == 3:
breakpoint() # noqa
rank += 4
return rank
class DebugeeActor(Actor):
@endpoint
async def to_debug(self):
rank = current_rank().rank
return _debugee_actor_internal(rank)
Setting Up the Debug Session#
The monarch debug server listens for breakpoints at a TCP address determined by these environment variables:
MONARCH_DEBUG_SERVER_HOST
(defaultlocalhost
, must be on the root client host where you run your monarch program)MONARCH_DEBUG_SERVER_PORT
(default27000
)
Simply run your monarch program as usual with the desired values for host and port.
if __name__ == "__main__":
# Create a mesh with 4 "hosts" and 4 gpus per "host"
process_mesh = this_host().spawn_procs(per_host={"host": 4, "gpu": 4})
# Spawn the actor you want to debug on the mesh
debugee_mesh = process_mesh.spawn("debugee", DebugeeActor)
# Call the endpoint you want to debug
print(debugee_mesh.to_debug.call().get())
Using the Monarch Debugger#
To access the debugger, from a separate terminal with conda activated and monarch installed, run:
monarch debug
There are two optional flags:
--host
: the value ofMONARCH_DEBUG_SERVER_HOST
in your program (same default as before)--port
: the value ofMONARCH_DEBUG_SERVER_PORT
in your program (same default as before)
You should then see this:
************************ MONARCH DEBUGGER ************************
Enter 'help' for a list of commands.
Enter 'list' to show all active breakpoints.
monarch_dbg>
Enter list
, and you should see a table showing all actors in your system
that are currently stopped at a breakpoint, along with basic information
about each breakpoint including actor name, rank, coordinates, hostname,
function, and line number.
Attaching to a Specific Actor#
From the monarch_dbg>
prompt, you can dive into a specific actor/breakpoint
using the attach
command, specifying the name and rank of the actor:
monarch_dbg> attach debugee 13
Attached to debug session for rank 13 (your.host.com)
> /path/to/debugging.py(16)to_debug()
-> rank = _debugee_actor_internal(rank)
(Pdb)
From here, you can send arbitrary pdb commands to the attached actor:
(Pdb) s
--Call--
> /path/to/debugging.py(20)_debugee_actor_internal()
-> def _debugee_actor_internal(rank):
(Pdb) n
> /path/to/debugging.py(21)_debugee_actor_internal()
-> if rank % 4 == 0:
(Pdb) rank
13
The debugger will automatically detach when the endpoint completes, but you
can detach early using the detach
command.
Casting Commands to Multiple Actors#
You can send pdb
commands to multiple actors on the same actor mesh at once
using the cast
command. The usage is:
monarch_dbg> cast <actor_name> ranks(<ranks>) <pdb_command>
There are several ways to specify ranks:
ranks(<rank>)
: sends a command to a single rank without attachingranks(<r1>,<r2>,<r3>)
: sends to comma-separated list of ranksranks(<r_start>:<r_stop>:<r_step>)
: like python list indexing syntaxranks(<dim1>=<...>, <dim2>=<...>)
: sends to specified coordinates
Example commands:
monarch_dbg> cast debugee ranks(0,1) n # casts `n` to ranks 0 and 1
monarch_dbg> cast debugee ranks(2:7:2) s # casts `s` to ranks 2, 4 and 6
monarch_dbg> cast debugee ranks(host=2:4, gpus=1:3) c # casts `c` to ranks where host dimension is 2 or 3 and gpu dimension is 1 or 2
Post-Mortem Debugging#
If an actor endpoint raises an error after a breakpoint has been hit, execution will stop where the error was raised to allow for post-mortem debugging. This is currently enabled by default and requires that the endpoint already hit a breakpoint to access post-mortem debugging.
In the example above, rank 2 will hit the _bad_rank()
function which
raises a ValueError, allowing you to inspect the state at the point
of failure.
Continuing Execution#
To allow execution to continue, from the monarch_dbg>
prompt,
simply enter c
or continue
. This will clear any non-hardcoded
breakpoints and cast the “continue
” pdb
command to all ranks
currently stopped at a breakpoint.
Total running time of the script: (0 minutes 0.000 seconds)