RAMP: A Flat Nanosecond Optical Network and MPI Operations for Distributed Deep Learning Systems

Ottino, Alessandro, Benjamin, Joshua, Zervas, Georgios

Feb-24-2023–arXiv.org Artificial Intelligence

Distributed deep learning (DDL) systems strongly depend on network performance. Current electronic packet switched (EPS) network architectures and technologies suffer from variable diameter topologies, low-bisection bandwidth and over-subscription affecting completion time of communication and collective operations. We introduce a near-exascale, full-bisection bandwidth, all-to-all, single-hop, all-optical network architecture with nanosecond reconfiguration called RAMP, which supports large-scale distributed and parallel computing systems (12.8~Tbps per node for up to 65,536 nodes). For the first time, a custom RAMP-x MPI strategy and a network transcoder is proposed to run MPI collective operations across the optical circuit switched (OCS) network in a schedule-less and contention-less manner. RAMP achieves 7.6-171$\times$ speed-up in completion time across all MPI operations compared to realistic EPS and OCS counterparts. It can also deliver a 1.3-16$\times$ and 7.8-58$\times$ reduction in Megatron and DLRM training time respectively} while offering 42-53$\times$ and 3.3-12.4$\times$ improvement in energy consumption and cost respectively.

artificial intelligence, machine learning, opération, (18 more...)

arXiv.org Artificial Intelligence

Feb-24-2023

arXiv.org PDF

Add feedback

Country:
- North America > United States
  - Washington > King County
    - Renton (0.04)
  - New York > New York County
    - New York City (0.04)
- Europe
  - United Kingdom (0.14)
  - Switzerland > Basel-City
    - Basel (0.04)
- Asia
  - Middle East > Jordan (0.04)
  - China (0.04)

Genre:
- Workflow (1.00)
- Research Report (0.64)

Industry:
- Information Technology (1.00)
- Energy (0.87)
- Telecommunications > Networks (0.84)

Technology:
- Information Technology
  - Communications > Networks (1.00)
  - Artificial Intelligence > Machine Learning
    - Neural Networks > Deep Learning (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found