Pillai, Natesh
Efficient AI in Practice: Training and Deployment of Efficient LLMs for Industry Applications
Behdin, Kayhan, Dai, Yun, Fatahibaarzi, Ata, Gupta, Aman, Song, Qingquan, Tang, Shao, Sang, Hejian, Dexter, Gregory, Zhu, Sirou, Zhu, Siyu, Dharamsi, Tejas, Sanjabi, Maziar, Kothapalli, Vignesh, Firooz, Hamed, Fu, Zhoutong, Cao, Yihan, Hsu, Pin-Lun, Borisyuk, Fedor, Wang, Zhipeng, Mazumder, Rahul, Pillai, Natesh, Simon, Luke
Large language models (LLMs) have demonstrated remarkable performance across a wide range of industrial applications, from search and recommendations to generative tasks. Although scaling laws indicate that larger models generally yield better generalization and performance, their substantial computational requirements often render them impractical for many real-world scenarios at scale. In this paper, we present methods and insights for training small language models (SLMs) that deliver high performance and efficiency in deployment. We focus on two key techniques: (1) knowledge distillation and (2) model compression via quantization and pruning. These approaches enable SLMs to retain much of the quality of their larger counterparts while significantly reducing training, serving costs, and latency. We detail the impact of these techniques on a variety of use cases at a large professional social network platform and share deployment lessons - including hardware optimization strategies that enhance speed and throughput for both predictive and reasoning-based applications.
AlphaPO -- Reward shape matters for LLM alignment
Gupta, Aman, Tang, Shao, Song, Qingquan, Zhu, Sirou, Hong, Jiwoo, Saha, Ankan, Gupta, Viral, Lee, Noah, Kim, Eunki, Zhu, Jason, Pillai, Natesh, Keerthi, S. Sathiya
Reinforcement Learning with Human Feedback (RLHF) and its variants have made huge strides toward the effective alignment of large language models (LLMs) to follow instructions and reflect human values. More recently, Direct Alignment Algorithms (DAAs) have emerged in which the reward modeling stage of RLHF is skipped by characterizing the reward directly as a function of the policy being learned. Examples include Direct Preference Optimization (DPO) and Simple Preference Optimization (SimPO). These methods often suffer from likelihood displacement, a phenomenon by which the probabilities of preferred responses are often reduced undesirably. In this paper, we argue that, for DAAs the reward (function) shape matters. We introduce AlphaPO, a new DAA method that leverages an $\alpha$-parameter to help change the shape of the reward function beyond the standard log reward. AlphaPO helps maintain fine-grained control over likelihood displacement and over-optimization. Compared to SimPO, one of the best performing DAAs, AlphaPO leads to about 7\% to 10\% relative improvement in alignment performance for the instruct versions of Mistral-7B and Llama3-8B. The analysis and results presented highlight the importance of the reward shape, and how one can systematically change it to affect training dynamics, as well as improve alignment performance.