Goto

Collaborating Authors

 cxl


PIFS-Rec: Process-In-Fabric-Switch for Large-Scale Recommendation System Inferences

Huo, Pingyi, Devulapally, Anusha, Maruf, Hasan Al, Park, Minseo, Nair, Krishnakumar, Arunachalam, Meena, Akbulut, Gulsum Gudukbay, Kandemir, Mahmut Taylan, Narayanan, Vijaykrishnan

arXiv.org Artificial Intelligence

Deep Learning Recommendation Models (DLRMs) have become increasingly popular and prevalent in today's datacenters, consuming most of the AI inference cycles. The performance of DLRMs is heavily influenced by available bandwidth due to their large vector sizes in embedding tables and concurrent accesses. To achieve substantial improvements over existing solutions, novel approaches towards DLRM optimization are needed, especially, in the context of emerging interconnect technologies like CXL. This study delves into exploring CXL-enabled systems, implementing a process-in-fabric-switch (PIFS) solution to accelerate DLRMs while optimizing their memory and bandwidth scalability. We present an in-depth characterization of industry-scale DLRM workloads running on CXL-ready systems, identifying the predominant bottlenecks in existing CXL systems. We, therefore, propose PIFS-Rec, a PIFS-based scheme that implements near-data processing through downstream ports of the fabric switch. PIFS-Rec achieves a latency that is 3.89x lower than Pond, an industry-standard CXL-based system, and also outperforms BEACON, a state-of-the-art scheme, by 2.03x.


How Flexible Is CXL's Memory Protection?

Communications of the ACM

Samuel W. Stark is a Ph.D. student and Harding Scholar in the Department of Computer Science and Technology at the University of Cambridge, U.K., where he is studying the wider applications of capabilities for shared-memory systems with Simon Moore. A. Theodore Markettos is a senior research associate in the Department of Computer Science and Technology at the University of Cambridge, U.K., where he co-leads the CAPcelerate project, which is researching the use of capabilities for securing distributed distrustful accelerators. Simon W. Moore is a professor of computer engineering in the Department of Computer Science and Technology at the University of Cambridge, U.K., where he conducts research and teaching in the general area of computer architecture, with particular interests in secure and rigorously engineered processors and subsystems.


Failure Tolerant Training with Persistent Memory Disaggregation over CXL

Kwon, Miryeong, Jang, Junhyeok, Choi, Hanjin, Lee, Sangwon, Jung, Myoungsoo

arXiv.org Artificial Intelligence

This paper proposes TRAININGCXL that can efficiently process large-scale recommendation datasets in the pool of disaggregated memory while making training fault tolerant with low overhead. To this end, i) we integrate persistent memory (PMEM) and GPU into a cache-coherent domain as Type-2. Enabling CXL allows PMEM to be directly placed in GPU's memory hierarchy, such that GPU can access PMEM without software intervention. TRAININGCXL introduces computing and checkpointing logic near the CXL controller, thereby training data and managing persistency in an active manner. Considering PMEM's vulnerability, ii) we utilize the unique characteristics of recommendation models and take the checkpointing overhead off the critical path of their training. Lastly, iii) TRAININGCXL employs an advanced checkpointing technique that relaxes the updating sequence of model parameters and embeddings across training batches. The evaluation shows that TRAININGCXL achieves 5.2x training performance improvement and 76% energy savings, compared to the modern PMEM-based recommendation systems.