Accelerating Frontier MoE Training with 3D Integrated Optics

Bernadskiy, Mikhail, Carson, Peter, Graham, Thomas, Groves, Taylor, Lee, Ho John, Yeh, Eric

Oct-21-2025–arXiv.org Artificial Intelligence

--The unabated growth in AI workload demands is driving the need for concerted advances in compute, memory, and interconnect performance. As traditional semiconductor scaling slows, high-speed interconnects have emerged as the new scaling engine, enabling the creation of larger logical GPUs by linking many GPUs into a single, low-latency, high-bandwidth compute domain. While initial scale-up fabrics leveraged copper interconnects for their power and cost advantages, the maximum reach of passive electrical interconnects (approximately 1 meter) effectively limits the scale-up domain to within a single rack. The advent of 3D-stacked optics and logic offers a transformative, power-efficient scale-up solution for connecting hundreds of GPU packages (thousands of GPUs) across multiple data center racks. This work explores the design tradeoffs of scale-up technologies and demonstrates how frontier LLMs necessitate novel photonic solutions to achieve aggressive power and performance targets. We model the benefits of 3D CPO (Passage) enabled GPUs and switches within the scale-up domain when training Frontier Mixture of Experts (MoE) models exceeding one trillion parameters. Our results show that the substantial increases in bandwidth and radix enabled by 3D CPO allow for an 8X increase in scale-up capability. The race to build larger, more sophisticated AI models is pushing the limits of existing infrastructure. At the chip and package level, GPUs are constrained by shoreline, yields and power. These challenges have led to the development of large high-bandwidth, low-latency scale-up pods. These pods effectively combine hundreds of GPUs into a single logical GPU to facilitate a variety of parallelism strategies (e.g. Approaches like Mixture of Experts (MoE) [1] have pushed scale-up networks to their limits due to copper reach (1 meter), which constrains the number of GPUs that can be connected within a single network hop. With MoEs, an ensemble of specialized sub-networks work together through sparse activations to increase model capacity without significantly increasing computational requirements. The output of the selected experts are combined to create the final result.

large language model, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

Oct-21-2025

arXiv.org PDF

Add feedback

Country:
- North America > United States (0.28)

Genre:
- Research Report > New Finding (0.68)

Industry:
- Telecommunications (0.68)
- Energy (0.67)
- Information Technology (0.67)

Technology:
- Information Technology
  - Communications > Networks (1.00)
  - Artificial Intelligence
    - Natural Language > Large Language Model (1.00)
    - Machine Learning > Neural Networks
      - Deep Learning (0.68)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found