Li, Karen
SambaNova SN40L: Scaling the AI Memory Wall with Dataflow and Composition of Experts
Prabhakar, Raghu, Sivaramakrishnan, Ram, Gandhi, Darshan, Du, Yun, Wang, Mingran, Song, Xiangyu, Zhang, Kejie, Gao, Tianren, Wang, Angela, Li, Karen, Sheng, Yongning, Brot, Joshua, Sokolov, Denis, Vivek, Apurv, Leung, Calvin, Sabnis, Arjun, Bai, Jiayu, Zhao, Tuowen, Gottscho, Mark, Jackson, David, Luttrell, Mark, Shah, Manish K., Chen, Edison, Liang, Kaizhao, Jain, Swayambhoo, Thakker, Urmish, Huang, Dawei, Jairath, Sumti, Brown, Kevin J., Olukotun, Kunle
Monolithic large language models (LLMs) like GPT-4 have paved the way for modern generative AI applications. Training, serving, and maintaining monolithic LLMs at scale, however, remains prohibitively expensive and challenging. The disproportionate increase in compute-to-memory ratio of modern AI accelerators have created a memory wall, necessitating new methods to deploy AI. Composition of Experts (CoE) is an alternative modular approach that lowers the cost and complexity of training and serving. However, this approach presents two key challenges when using conventional hardware: (1) without fused operations, smaller models have lower operational intensity, which makes high utilization more challenging to achieve; and (2) hosting a large number of models can be either prohibitively expensive or slow when dynamically switching between them. In this paper, we describe how combining CoE, streaming dataflow, and a three-tier memory system scales the AI memory wall. We describe Samba-CoE, a CoE system with 150 experts and a trillion total parameters. We deploy Samba-CoE on the SambaNova SN40L Reconfigurable Dataflow Unit (RDU) - a commercial dataflow accelerator architecture that has been co-designed for enterprise inference and training applications. The chip introduces a new three-tier memory system with on-chip distributed SRAM, on-package HBM, and off-package DDR DRAM. A dedicated inter-RDU network enables scaling up and out over multiple sockets. We demonstrate speedups ranging from 2x to 13x on various benchmarks running on eight RDU sockets compared with an unfused baseline. We show that for CoE inference deployments, the 8-socket RDU Node reduces machine footprint by up to 19x, speeds up model switching time by 15x to 31x, and achieves an overall speedup of 3.7x over a DGX H100 and 6.6x over a DGX A100.
A Novel Low-Cost, Recyclable, Easy-to-Build Robot Blimp For Transporting Supplies in Hard-to-Reach Locations
Li, Karen, Hou, Shuhang, Negash, Matyas, Xu, Jiawei, Jeffs, Edward, D'Antonio, Diego S., Saldaรฑa, David
Rural communities in remote areas often encounter significant challenges when it comes to accessing emergency healthcare services and essential supplies due to a lack of adequate transportation infrastructure. The situation is further exacerbated by poorly maintained, damaged, or flooded roads, making it arduous for rural residents to obtain the necessary aid in critical situations. Limited budgets and technological constraints pose additional obstacles, hindering the prompt response of local rescue teams during emergencies. The transportation of crucial resources, such as medical supplies and food, plays a vital role in saving lives in these situations. In light of these obstacles, our objective is to improve accessibility and alleviate the suffering of vulnerable populations by automating transportation tasks using low-cost robotic systems. We propose a low-cost, easy-to-build blimp robot (UAVs), that can significantly enhance the efficiency and effectiveness of local emergency responses.
Land Use Prediction using Electro-Optical to SAR Few-Shot Transfer Learning
Hussing, Marcel, Li, Karen, Eaton, Eric
Satellite image analysis has important implications for land use, urbanization, and ecosystem monitoring. Deep learning methods can facilitate the analysis of different satellite modalities, such as electro-optical (EO) and synthetic aperture radar (SAR) imagery, by supporting knowledge transfer between the modalities to compensate for individual shortcomings. Recent progress has shown how distributional alignment of neural network embeddings can produce powerful transfer learning models by employing a sliced Wasserstein distance (SWD) loss. We analyze how this method can be applied to Sentinel-1 and -2 satellite imagery and develop several extensions toward making it effective in practice. In an application to few-shot Local Climate Zone (LCZ) prediction, we show that these networks outperform multiple common baselines on datasets with a large number of classes. Further, we provide evidence that instance normalization can significantly stabilize the training process and that explicitly shaping the embedding space using supervised contrastive learning can lead to improved performance.