torchcomms#

torchcomms is an experimental, lightweight communication API for PyTorch Distributed (PTD). It provides a simplified, object-oriented interface for distributed collective operations with multiple out-of-the-box backends, including Meta’s production-tested NCCLX backend that powers all generative AI services.

Browse the documentation and Examples#

🚀 Quick Start

New to torchcomms? Start here to learn how to install and use torchcomms for distributed communication.

Getting Started

📚 API Reference

Complete API documentation for all torchcomms classes, functions, and backends.

API Reference

🔗 Hooks

Learn how to use hooks to monitor and debug collective operations with FlightRecorderHook.

Hooks

💻 Code Examples

Explore practical examples showing how to use torchcomms in real-world distributed applications.

https://github.com/meta-pytorch/torchcomms/tree/main/comms/torchcomms/examples

🐛 Report Issues

Found a bug or have a feature request? Let us know on GitHub.

https://github.com/meta-pytorch/torchcomms/issues

Why torchcomms?#

torchcomms addresses several key challenges in distributed PyTorch training:

Simplified API: Clean, object-oriented interface that abstracts away low-level communication details
Backend Flexibility: Easily switch between different communication backends (NCCLX, RCCLX, NCCL, RCCL, Gloo) without changing your code
Production-Ready: NCCLX backend is battle-tested in Meta’s production environments powering large-scale AI workloads
Type Safety: Full type hints and validation for better development experience
Performance: Optimized implementations with support for GPU-accelerated communication

Key Features#

Multiple Backends#

torchcomms supports several communication backends out of the box:

NCCLX: Meta’s enhanced NCCL implementation with additional optimizations
RCCLX: Meta’s enhanced RCCL implementation with additional optimizations
NCCL: NVIDIA’s Collective Communications Library for multi-GPU communication
RCCL: AMD ROCm Collective Communications Library for AMD GPUs
XCCL: Intel’s Collective Communications Library for Intel GPUs
Gloo: Facebook’s collective communications library for both CPU and GPU

Comprehensive Collective Operations#

All standard distributed operations are supported:

AllReduce, ReduceScatter, AllGather
Broadcast, Reduce, Scatter, Gather
Send, Recv for point-to-point communication
Support for both synchronous and asynchronous operations

Flexible Group Management#

Create and manage process groups with ease:

Initialize groups with different backends
Support for sub-groups and hierarchical communication patterns
Automatic resource management and cleanup