AITopics | Ardestani, Ehsan K.

Collaborating Authors

Ardestani, Ehsan K.

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

MTrainS: Improving DLRM training efficiency using heterogeneous memories

Kassa, Hiwot Tadese, Johnson, Paul, Akers, Jason, Ghosh, Mrinmoy, Tulloch, Andrew, Mudigere, Dheevatsa, Park, Jongsoo, Liu, Xing, Dreslinski, Ronald, Ardestani, Ehsan K.

arXiv.org Artificial IntelligenceApr-19-2023

Recommendation models are very large, requiring terabytes (TB) of memory during training. In pursuit of better quality, the model size and complexity grow over time, which requires additional training data to avoid overfitting. This model growth demands a large number of resources in data centers. Hence, training efficiency is becoming considerably more important to keep the data center power demand manageable. In Deep Learning Recommendation Models (DLRM), sparse features capturing categorical inputs through embedding tables are the major contributors to model size and require high memory bandwidth. In this paper, we study the bandwidth requirement and locality of embedding tables in real-world deployed models. We observe that the bandwidth requirement is not uniform across different tables and that embedding tables show high temporal locality. We then design MTrainS, which leverages heterogeneous memory, including byte and block addressable Storage Class Memory for DLRM hierarchically. MTrainS allows for higher memory capacity per node and increases training efficiency by lowering the need to scale out to multiple hosts in memory capacity bound use cases. By optimizing the platform memory hierarchy, we reduce the number of nodes for training by 4-8X, saving power and cost of training while meeting our target training performance.

artificial intelligence, bya-scm, machine learning, (19 more...)

arXiv.org Artificial Intelligence

2305.01515

Country: North America > United States (0.46)

Genre: Research Report (0.82)

Industry: Information Technology > Services (0.54)

Technology:

Information Technology > Communications (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.67)

Add feedback

High-performance, Distributed Training of Large-scale Deep Learning Recommendation Models

Mudigere, Dheevatsa, Hao, Yuchen, Huang, Jianyu, Tulloch, Andrew, Sridharan, Srinivas, Liu, Xing, Ozdal, Mustafa, Nie, Jade, Park, Jongsoo, Luo, Liang, Yang, Jie Amy, Gao, Leon, Ivchenko, Dmytro, Basant, Aarti, Hu, Yuxi, Yang, Jiyan, Ardestani, Ehsan K., Wang, Xiaodong, Komuravelli, Rakesh, Chu, Ching-Hsiang, Yilmaz, Serhat, Li, Huayu, Qian, Jiyuan, Feng, Zhuobo, Ma, Yinbin, Yang, Junjie, Wen, Ellie, Li, Hong, Yang, Lin, Sun, Chonglin, Zhao, Whitney, Melts, Dimitry, Dhulipala, Krishna, Kishore, KR, Graf, Tyler, Eisenman, Assaf, Matam, Kiran Kumar, Gangidi, Adi, Chen, Guoqiang Jerry, Krishnan, Manoj, Nayak, Avinash, Nair, Krishnakumar, Muthiah, Bharath, khorashadi, Mahmoud, Bhattacharya, Pallab, Lapukhov, Petr, Naumov, Maxim, Qiao, Lin, Smelyanskiy, Mikhail, Jia, Bill, Rao, Vijay

arXiv.org Artificial IntelligenceApr-15-2021

Deep learning recommendation models (DLRMs) are used across many business-critical services at Facebook and are the single largest AI application in terms of infrastructure demand in its data-centers. In this paper we discuss the SW/HW co-designed solution for high-performance distributed training of large-scale DLRMs. We introduce a high-performance scalable software stack based on PyTorch and pair it with the new evolution of Zion platform, namely ZionEX. We demonstrate the capability to train very large DLRMs with up to 12 Trillion parameters and show that we can attain 40X speedup in terms of time to solution over previous systems. We achieve this by (i) designing the ZionEX platform with dedicated scale-out network, provisioned with high bandwidth, optimal topology and efficient transport (ii) implementing an optimized PyTorch-based training stack supporting both model and data parallelism (iii) developing sharding algorithms capable of hierarchical partitioning of the embedding tables along row, column dimensions and load balancing them across multiple workers; (iv) adding high-performance core operators while retaining flexibility to support optimizers with fully deterministic updates (v) leveraging reduced precision communications, multi-level memory hierarchy (HBM+DDR+SSD) and pipelining. Furthermore, we develop and briefly comment on distributed data ingestion and other supporting services that are required for the robust and efficient end-to-end training in production environments.

communication, deep learning, neural network, (18 more...)

arXiv.org Artificial Intelligence

2104.05158

Country: North America > United States (0.46)

Genre: Research Report (0.50)

Industry: Information Technology > Services (1.00)

Technology:

Information Technology > Communications (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback