Fan, Ying
VersaPRM: Multi-Domain Process Reward Model via Synthetic Reasoning Data
Zeng, Thomas, Zhang, Shuibai, Wu, Shutong, Classen, Christian, Chae, Daewon, Ewer, Ethan, Lee, Minjae, Kim, Heeju, Kang, Wonjun, Kunde, Jackson, Fan, Ying, Kim, Jungtaek, Koo, Hyung Il, Ramchandran, Kannan, Papailiopoulos, Dimitris, Lee, Kangwook
In particular, Outcome Reward Models (ORMs) are Process Reward Models (PRMs) have proven used to provide supervision based solely on the correctness effective at enhancing mathematical reasoning of the final outcome. However, ORMs fail to address errors for Large Language Models (LLMs) by leveraging in intermediate steps, limiting their effectiveness for increased inference-time computation. However, complex, multi-step reasoning tasks (Luo et al., 2024; Lightman they are predominantly trained on mathematical et al., 2024; Sun et al., 2024). Because ORMs suffer data and their generalizability to nonmathematical from this limitation, Process Reward Models (PRMs) have domains has not been rigorously been proposed to offer fine-grained, step-by-step feedback studied. In response, this work first shows that on the correctness of each reasoning step (Lightman et al., current PRMs have poor performance in other 2024; Uesato et al., 2022). PRMs have proven highly effective domains. To address this limitation, we introduce during inference, improving the reranking of generated VersaPRM, a multi-domain PRM trained solutions and guiding LLMs through search-based on synthetic reasoning data generated using our algorithms (Wan et al., 2024; Wang et al., 2024a).
Looped Transformers for Length Generalization
Fan, Ying, Du, Yilun, Ramchandran, Kannan, Lee, Kangwook
Recent work has shown that Transformers trained from scratch can successfully solve various arithmetic and algorithmic tasks, such as adding numbers and computing parity. While these Transformers generalize well on unseen inputs of the same length, they struggle with length generalization, i.e., handling inputs of unseen lengths. In this work, we demonstrate that looped Transformers with an adaptive number of steps significantly improve length generalization. We focus on tasks with a known iterative solution, involving multiple iterations of a RASP-L operation - a length-generalizable operation that can be expressed by a finite-sized Transformer. We train looped Transformers using our proposed learning algorithm and observe that they learn highly length-generalizable solutions for various tasks.
How to Solve Contextual Goal-Oriented Problems with Offline Datasets?
Fan, Ying, Li, Jingling, Swaminathan, Adith, Modi, Aditya, Cheng, Ching-An
We present a novel method, Contextual goal-Oriented Data Augmentation (CODA), which uses commonly available unlabeled trajectories and context-goal pairs to solve Contextual Goal-Oriented (CGO) problems. By carefully constructing an action-augmented MDP that is equivalent to the original MDP, CODA creates a fully labeled transition dataset under training contexts without additional approximation error. We conduct a novel theoretical analysis to demonstrate CODA's capability to solve CGO problems in the offline data setup. Empirical results also showcase the effectiveness of CODA, which outperforms other baseline methods across various context-goal relationships of CGO problem. This approach offers a promising direction to solving CGO problems using offline datasets.
Domain Generalization via Nuclear Norm Regularization
Shi, Zhenmei, Ming, Yifei, Fan, Ying, Sala, Frederic, Liang, Yingyu
The ability to generalize to unseen domains is crucial for machine learning systems deployed in the real world, especially when we only have data from limited training domains. In this paper, we propose a simple and effective regularization method based on the nuclear norm of the learned features for domain generalization. Intuitively, the proposed regularizer mitigates the impacts of environmental features and encourages learning domain-invariant features. Theoretically, we provide insights into why nuclear norm regularization is more effective compared to ERM and alternative regularization methods. Empirically, we conduct extensive experiments on both synthetic and real datasets. We show nuclear norm regularization achieves strong performance compared to baselines in a wide range of domain generalization tasks. Moreover, our regularizer is broadly applicable with various methods such as ERM and SWAD with consistently improved performance, e.g., 1.7% and 0.9% test accuracy improvements respectively on the DomainBed benchmark.
DPOK: Reinforcement Learning for Fine-tuning Text-to-Image Diffusion Models
Fan, Ying, Watkins, Olivia, Du, Yuqing, Liu, Hao, Ryu, Moonkyung, Boutilier, Craig, Abbeel, Pieter, Ghavamzadeh, Mohammad, Lee, Kangwook, Lee, Kimin
Learning from human feedback has been shown to improve text-to-image models. These techniques first learn a reward function that captures what humans care about in the task and then improve the models based on the learned reward function. Even though relatively simple approaches (e.g., rejection sampling based on reward scores) have been investigated, fine-tuning text-to-image models with the reward function remains challenging. In this work, we propose using online reinforcement learning (RL) to fine-tune text-to-image models. We focus on diffusion models, defining the fine-tuning task as an RL problem, and updating the pre-trained text-to-image diffusion models using policy gradient to maximize the feedback-trained reward. Our approach, coined DPOK, integrates policy optimization with KL regularization. We conduct an analysis of KL regularization for both RL fine-tuning and supervised fine-tuning. In our experiments, we show that DPOK is generally superior to supervised fine-tuning with respect to both image-text alignment and image quality. Our code is available at https://github.com/google-research/google-research/tree/master/dpok.
Optimizing DDPM Sampling with Shortcut Fine-Tuning
Fan, Ying, Lee, Kangwook
In this study, we propose Shortcut Fine-Tuning (SFT), a new approach for addressing the challenge of fast sampling of pretrained Denoising Diffusion Probabilistic Models (DDPMs). SFT advocates for the fine-tuning of DDPM samplers through the direct minimization of Integral Probability Metrics (IPM), instead of learning the backward diffusion process. This enables samplers to discover an alternative and more efficient sampling shortcut, deviating from the backward diffusion process. Inspired by a control perspective, we propose a new algorithm SFT-PG: Shortcut Fine-Tuning with Policy Gradient, and prove that under certain assumptions, gradient descent of diffusion models with respect to IPM is equivalent to performing policy gradient. To our best knowledge, this is the first attempt to utilize reinforcement learning (RL) methods to train diffusion models. Through empirical evaluation, we demonstrate that our fine-tuning method can further enhance existing fast DDPM samplers, resulting in sample quality comparable to or even surpassing that of the full-step model across various datasets.
Score-based Generative Modeling Secretly Minimizes the Wasserstein Distance
Kwon, Dohyun, Fan, Ying, Lee, Kangwook
Score-based generative models are shown to achieve remarkable empirical performances in various applications such as image generation and audio synthesis. However, a theoretical understanding of score-based diffusion models is still incomplete. Recently, Song et al. showed that the training objective of score-based generative models is equivalent to minimizing the Kullback-Leibler divergence of the generated distribution from the data distribution. In this work, we show that score-based models also minimize the Wasserstein distance between them under suitable assumptions on the model. Specifically, we prove that the Wasserstein distance is upper bounded by the square root of the objective function up to multiplicative constants and a fixed constant offset. Our proof is based on a novel application of the theory of optimal transport, which can be of independent interest to the society. Our numerical experiments support our findings. By analyzing our upper bounds, we provide a few techniques to obtain tighter upper bounds.
Search-based User Interest Modeling with Lifelong Sequential Behavior Data for Click-Through Rate Prediction
Qi, Pi, Zhu, Xiaoqiang, Zhou, Guorui, Zhang, Yujing, Wang, Zhe, Ren, Lejian, Fan, Ying, Gai, Kun
Rich user behavior data has been proven to be of great value for click-through rate prediction tasks, especially in industrial applications such as recommender systems and online advertising. Both industry and academy have paid much attention to this topic and propose different approaches to modeling with long sequential user behavior data. Among them, memory network based model MIMN proposed by Alibaba, achieves SOTA with the co-design of both learning algorithm and serving system. MIMN is the first industrial solution that can model sequential user behavior data with length scaling up to 1000. However, MIMN fails to precisely capture user interests given a specific candidate item when the length of user behavior sequence increases further, say, by 10 times or more. This challenge exists widely in previously proposed approaches. In this paper, we tackle this problem by designing a new modeling paradigm, which we name as Search-based Interest Model (SIM). SIM extracts user interests with two cascaded search units: (i) General Search Unit acts as a general search from the raw and arbitrary long sequential behavior data, with query information from candidate item, and gets a Sub user Behavior Sequence which is relevant to candidate item; (ii) Exact Search Unit models the precise relationship between candidate item and SBS. This cascaded search paradigm enables SIM with a better ability to model lifelong sequential behavior data in both scalability and accuracy. Apart from the learning algorithm, we also introduce our hands-on experience on how to implement SIM in large scale industrial systems. Since 2019, SIM has been deployed in the display advertising system in Alibaba, bringing 7.1\% CTR and 4.4\% RPM lift, which is significant to the business. Serving the main traffic in our real system now, SIM models user behavior data with maximum length reaching up to 54000, pushing SOTA to 54x.
Efficient Model-Free Reinforcement Learning Using Gaussian Process
Fan, Ying, Chen, Letian, Wang, Yizhou
Efficient Reinforcement Learning usually takes advantage of demonstration or good exploration strategy. By applying posterior sampling in model-free RL under the hypothesis of GP, we propose Gaussian Process Posterior Sampling Reinforcement Learning(GPPSTD) algorithm in continuous state space, giving theoretical justifications and empirical results. We also provide theoretical and empirical results that various demonstration could lower expected uncertainty and benefit posterior sampling exploration. In this way, we combined the demonstration and exploration process together to achieve a more efficient reinforcement learning.
Deep Interest Evolution Network for Click-Through Rate Prediction
Zhou, Guorui, Mou, Na, Fan, Ying, Pi, Qi, Bian, Weijie, Zhou, Chang, Zhu, Xiaoqiang, Gai, Kun
Click-through rate~(CTR) prediction, whose goal is to estimate the probability of the user clicks, has become one of the core tasks in advertising systems. For CTR prediction model, it is necessary to capture the latent user interest behind the user behavior data. Besides, considering the changing of the external environment and the internal cognition, user interest evolves over time dynamically. There are several CTR prediction methods for interest modeling, while most of them regard the representation of behavior as the interest directly, and lack specially modeling for latent interest behind the concrete behavior. Moreover, few work consider the changing trend of interest. In this paper, we propose a novel model, named Deep Interest Evolution Network~(DIEN), for CTR prediction. Specifically, we design interest extractor layer to capture temporal interests from history behavior sequence. At this layer, we introduce an auxiliary loss to supervise interest extracting at each step. As user interests are diverse, especially in the e-commerce system, we propose interest evolving layer to capture interest evolving process that is relative to the target item. At interest evolving layer, attention mechanism is embedded into the sequential structure novelly, and the effects of relative interests are strengthened during interest evolution. In the experiments on both public and industrial datasets, DIEN significantly outperforms the state-of-the-art solutions. Notably, DIEN has been deployed in the display advertisement system of Taobao, and obtained 20.7\% improvement on CTR.