torchft#
This repository implements primitives and E2E solutions for doing a per-step fault tolerance so you can keep training if errors occur without interrupting the entire training job.
GETTING STARTED? See Install and Usage in the README.
API Reference
- API Reference
- Process Groups
ErrorSwallowingProcessGroupWrapperFakeProcessGroupWrapperManagedProcessGroupProcessGroupProcessGroupBabyProcessGroupBabyGlooProcessGroupBabyNCCLProcessGroupBabyXCCLProcessGroupDummyProcessGroupGlooProcessGroupNCCLProcessGroupWrapperProcessGroupXCCLcreate_store_client()trigger_nccl_fr_trace_through_pipe()- Manager
ExceptionWithTracebackManagerWorldSizeModeextract_trailing_digits()get_timeout()- Optimizers
OptimizerWrapper- Distributed Data Parallel
DistributedDataParallelPureDistributedDataParallel- LocalSGD
DiLoCoLocalSGDextract_local_tensor()- Data
DistributedSampler- Checkpointing
CheckpointTransportHTTPTransport- Parameter Servers
ParameterServer- Coordination (Low Level API)
LighthouseClientLighthouseServerManagerClientManagerServerQuorumQuorumMember