:github_url: https://github.com/meta-pytorch/torchcomms torchcomms ========== **torchcomms** is an experimental, lightweight communication API for PyTorch Distributed (PTD). It provides a simplified, object-oriented interface for distributed collective operations with multiple out-of-the-box backends, including Meta's production-tested **NCCLX** backend that powers all generative AI services. .. raw:: html
---- Browse the documentation and Examples ------------------------------------- .. grid:: 2 :gutter: 3 .. grid-item-card:: 🚀 Quick Start :link: getting_started :link-type: doc New to torchcomms? Start here to learn how to install and use torchcomms for distributed communication. .. grid-item-card:: 📚 API Reference :link: api :link-type: doc Complete API documentation for all torchcomms classes, functions, and backends. .. grid-item-card:: 🔗 Hooks :link: hooks :link-type: doc Learn how to use hooks to monitor and debug collective operations with FlightRecorderHook. .. grid-item-card:: 💻 Code Examples :link: https://github.com/meta-pytorch/torchcomms/tree/main/comms/torchcomms/examples Explore practical examples showing how to use torchcomms in real-world distributed applications. .. grid-item-card:: 🐛 Report Issues :link: https://github.com/meta-pytorch/torchcomms/issues Found a bug or have a feature request? Let us know on GitHub. ---- .. raw:: html Why torchcomms? --------------- torchcomms addresses several key challenges in distributed PyTorch training: * **Simplified API**: Clean, object-oriented interface that abstracts away low-level communication details * **Backend Flexibility**: Easily switch between different communication backends (NCCLX, RCCLX, NCCL, RCCL, Gloo) without changing your code * **Production-Ready**: NCCLX backend is battle-tested in Meta's production environments powering large-scale AI workloads * **Type Safety**: Full type hints and validation for better development experience * **Performance**: Optimized implementations with support for GPU-accelerated communication Key Features ------------ Multiple Backends ^^^^^^^^^^^^^^^^^ torchcomms supports several communication backends out of the box: * **NCCLX**: Meta's enhanced NCCL implementation with additional optimizations * **RCCLX**: Meta's enhanced RCCL implementation with additional optimizations * **NCCL**: NVIDIA's Collective Communications Library for multi-GPU communication * **RCCL**: AMD ROCm Collective Communications Library for AMD GPUs * **XCCL**: Intel's Collective Communications Library for Intel GPUs * **Gloo**: Facebook's collective communications library for both CPU and GPU Comprehensive Collective Operations ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ All standard distributed operations are supported: * AllReduce, ReduceScatter, AllGather * Broadcast, Reduce, Scatter, Gather * Send, Recv for point-to-point communication * Support for both synchronous and asynchronous operations Flexible Group Management ^^^^^^^^^^^^^^^^^^^^^^^^^^ Create and manage process groups with ease: * Initialize groups with different backends * Support for sub-groups and hierarchical communication patterns * Automatic resource management and cleanup .. raw:: html ---- .. toctree:: :maxdepth: 2 :caption: Contents :hidden: getting_started api hooks