cxl
PIFS-Rec: Process-In-Fabric-Switch for Large-Scale Recommendation System Inferences
Huo, Pingyi, Devulapally, Anusha, Maruf, Hasan Al, Park, Minseo, Nair, Krishnakumar, Arunachalam, Meena, Akbulut, Gulsum Gudukbay, Kandemir, Mahmut Taylan, Narayanan, Vijaykrishnan
Deep Learning Recommendation Models (DLRMs) have become increasingly popular and prevalent in today's datacenters, consuming most of the AI inference cycles. The performance of DLRMs is heavily influenced by available bandwidth due to their large vector sizes in embedding tables and concurrent accesses. To achieve substantial improvements over existing solutions, novel approaches towards DLRM optimization are needed, especially, in the context of emerging interconnect technologies like CXL. This study delves into exploring CXL-enabled systems, implementing a process-in-fabric-switch (PIFS) solution to accelerate DLRMs while optimizing their memory and bandwidth scalability. We present an in-depth characterization of industry-scale DLRM workloads running on CXL-ready systems, identifying the predominant bottlenecks in existing CXL systems. We, therefore, propose PIFS-Rec, a PIFS-based scheme that implements near-data processing through downstream ports of the fabric switch. PIFS-Rec achieves a latency that is 3.89x lower than Pond, an industry-standard CXL-based system, and also outperforms BEACON, a state-of-the-art scheme, by 2.03x.
- North America > United States > Pennsylvania > Centre County > University Park (0.04)
- North America > United States > Texas > Travis County > Austin (0.04)
- Europe (0.04)
- (2 more...)
How Flexible Is CXL's Memory Protection?
Samuel W. Stark is a Ph.D. student and Harding Scholar in the Department of Computer Science and Technology at the University of Cambridge, U.K., where he is studying the wider applications of capabilities for shared-memory systems with Simon Moore. A. Theodore Markettos is a senior research associate in the Department of Computer Science and Technology at the University of Cambridge, U.K., where he co-leads the CAPcelerate project, which is researching the use of capabilities for securing distributed distrustful accelerators. Simon W. Moore is a professor of computer engineering in the Department of Computer Science and Technology at the University of Cambridge, U.K., where he conducts research and teaching in the general area of computer architecture, with particular interests in secure and rigorously engineered processors and subsystems.
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (1.00)
- Europe > Switzerland > Zürich > Zürich (0.04)
- Information Technology > Security & Privacy (1.00)
- Information Technology > Hardware (0.94)
- Information Technology > Artificial Intelligence (0.68)
Failure Tolerant Training with Persistent Memory Disaggregation over CXL
Kwon, Miryeong, Jang, Junhyeok, Choi, Hanjin, Lee, Sangwon, Jung, Myoungsoo
This paper proposes TRAININGCXL that can efficiently process large-scale recommendation datasets in the pool of disaggregated memory while making training fault tolerant with low overhead. To this end, i) we integrate persistent memory (PMEM) and GPU into a cache-coherent domain as Type-2. Enabling CXL allows PMEM to be directly placed in GPU's memory hierarchy, such that GPU can access PMEM without software intervention. TRAININGCXL introduces computing and checkpointing logic near the CXL controller, thereby training data and managing persistency in an active manner. Considering PMEM's vulnerability, ii) we utilize the unique characteristics of recommendation models and take the checkpointing overhead off the critical path of their training. Lastly, iii) TRAININGCXL employs an advanced checkpointing technique that relaxes the updating sequence of model parameters and embeddings across training batches. The evaluation shows that TRAININGCXL achieves 5.2x training performance improvement and 76% energy savings, compared to the modern PMEM-based recommendation systems.