Chen, Yuxin
Using LLMs for Automated Privacy Policy Analysis: Prompt Engineering, Fine-Tuning and Explainability
Chen, Yuxin, Tang, Peng, Qiu, Weidong, Li, Shujun
Privacy policies are widely used by digital services and often required for legal purposes. Many machine learning based classifiers have been developed to automate detection of different concepts in a given privacy policy, which can help facilitate other automated tasks such as producing a more reader-friendly summary and detecting legal compliance issues. Despite the successful applications of large language models (LLMs) to many NLP tasks in various domains, there is very little work studying the use of LLMs for automated privacy policy analysis, therefore, if and how LLMs can help automate privacy policy analysis remains under-explored. To fill this research gap, we conducted a comprehensive evaluation of LLM-based privacy policy concept classifiers, employing both prompt engineering and LoRA (low-rank adaptation) fine-tuning, on four state-of-the-art (SOTA) privacy policy corpora and taxonomies. Our experimental results demonstrated that combining prompt engineering and fine-tuning can make LLM-based classifiers outperform other SOTA methods, \emph{significantly} and \emph{consistently} across privacy policy corpora/taxonomies and concepts. Furthermore, we evaluated the explainability of the LLM-based classifiers using three metrics: completeness, logicality, and comprehensibility. For all three metrics, a score exceeding 91.1\% was observed in our evaluation, indicating that LLMs are not only useful to improve the classification performance, but also to enhance the explainability of detection results.
Residual Policy Gradient: A Reward View of KL-regularized Objective
Wang, Pengcheng, Zhu, Xinghao, Chen, Yuxin, Xu, Chenfeng, Tomizuka, Masayoshi, Li, Chenran
Reinforcement Learning and Imitation Learning have achieved widespread success in many domains but remain constrained during real-world deployment. One of the main issues is the additional requirements that were not considered during training. To address this challenge, policy customization has been introduced, aiming to adapt a prior policy while preserving its inherent properties and meeting new task-specific requirements. A principled approach to policy customization is Residual Q-Learning (RQL), which formulates the problem as a Markov Decision Process (MDP) and derives a family of value-based learning algorithms. However, RQL has not yet been applied to policy gradient methods, which restricts its applicability, especially in tasks where policy gradient has already proven more effective. In this work, we first derive a concise form of Soft Policy Gradient as a preliminary. Building on this, we introduce Residual Policy Gradient (RPG), which extends RQL to policy gradient methods, allowing policy customization in gradient-based RL settings. With the view of RPG, we rethink the KL-regularized objective widely used in RL fine-tuning. We show that under certain assumptions, KL-regularized objective leads to a maximum-entropy policy that balances the inherent properties and task-specific requirements on a reward-level. Our experiments in MuJoCo demonstrate the effectiveness of Soft Policy Gradient and Residual Policy Gradient.
External Large Foundation Model: How to Efficiently Serve Trillions of Parameters for Online Ads Recommendation
Liang, Mingfu, Liu, Xi, Jin, Rong, Liu, Boyang, Suo, Qiuling, Zhou, Qinghai, Zhou, Song, Chen, Laming, Zheng, Hua, Li, Zhiyuan, Jiang, Shali, Yang, Jiyan, Xia, Xiaozhen, Yang, Fan, Badr, Yasmine, Wen, Ellie, Xu, Shuyu, Chen, Hansey, Zhang, Zhengyu, Nie, Jade, Yang, Chunzhi, Zeng, Zhichen, Zhang, Weilin, Huang, Xingliang, Li, Qianru, Wang, Shiquan, Lyu, Evelyn, Lu, Wenjing, Zhang, Rui, Wang, Wenjun, Rudy, Jason, Hang, Mengyue, Wang, Kai, Ma, Yinbin, Wang, Shuaiwen, Zeng, Sihan, Tang, Tongyi, Wei, Xiaohan, Jin, Longhao, Zhang, Jamey, Chen, Marcus, Zhang, Jiayi, Huang, Angie, Zhang, Chi, Zhao, Zhengli, Yang, Jared, Jin, Qiang, Chen, Xian, Amlesahwaram, Amit Anand, Song, Lexi, Luo, Liang, Hao, Yuchen, Xiao, Nan, Yetim, Yavuz, Pan, Luoshang, Liu, Gaoxiang, Hu, Yuxi, Huang, Yuzhen, Xu, Jackie, Zhu, Rich, Zhang, Xin, Liu, Yiqun, Yin, Hang, Chen, Yuxin, Zhang, Buyun, Liu, Xiaoyi, Wang, Xingyuan, Mao, Wenguang, Li, Zhijing, Huang, Qin, Sun, Chonglin, Yu, Nancy, Gu, Shuo, Mao, Shupin, Au, Benjamin, Qin, Jingzheng, Yao, Peggy, Choi, Jae-Woo, Gao, Bin, Wang, Ernest, Zhang, Lei, Chen, Wen-Yen, Lee, Ted, Zha, Jay, Meng, Yi, Gong, Alex, Gao, Edison, Vahdatpour, Alireza, Han, Yiping, Yao, Yantao, Kureha, Toshinari, Chang, Shuo, Sultan, Musharaf, Bocharov, John, Chordia, Sagar, Gan, Xiaorui, Sun, Peng, Liu, Rocky, Long, Bo, Chen, Wenlin, Kolay, Santanu, Li, Huayu
Ads recommendation is a prominent service of online advertising systems and has been actively studied. Recent studies indicate that scaling-up and advanced design of the recommendation model can bring significant performance improvement. However, with a larger model scale, such prior studies have a significantly increasing gap from industry as they often neglect two fundamental challenges in industrial-scale applications. First, training and inference budgets are restricted for the model to be served, exceeding which may incur latency and impair user experience. Second, large-volume data arrive in a streaming mode with data distributions dynamically shifting, as new users/ads join and existing users/ads leave the system. We propose the External Large Foundation Model (ExFM) framework to address the overlooked challenges. Specifically, we develop external distillation and a data augmentation system (DAS) to control the computational cost of training/inference while maintaining high performance. We design the teacher in a way like a foundation model (FM) that can serve multiple students as vertical models (VMs) to amortize its building cost. We propose Auxiliary Head and Student Adapter to mitigate the data distribution gap between FM and VMs caused by the streaming data issue. Comprehensive experiments on internal industrial-scale applications and public datasets demonstrate significant performance gain by ExFM.
Physics-Aware Robotic Palletization with Online Masking Inference
Zhang, Tianqi, Wu, Zheng, Chen, Yuxin, Wang, Yixiao, Liang, Boyuan, Moura, Scott, Tomizuka, Masayoshi, Ding, Mingyu, Zhan, Wei
-- The efficient planning of stacking boxes, especially in the online setting where the sequence of item arrivals is unpredictable, remains a critical challenge in modern warehouse and logistics management. Existing solutions often address box size variations, but overlook their intrinsic and physical properties, such as density and rigidity, which are crucial for real-world applications. We use reinforcement learning (RL) to solve this problem by employing action space masking to direct the RL policy toward valid actions. Unlike previous methods that rely on heuristic stability assessments which are difficult to assess in physical scenarios, our framework utilizes online learning to dynamically train the action space mask, eliminating the need for manual heuristic design. Extensive experiments demonstrate that our proposed method outperforms existing state-of-the-arts. Furthermore, we deploy our learned task planner in a real-world robotic palletizer, validating its practical applicability in operational settings. I. INTRODUCTION In modern warehouse and logistics management, stacking boxes continues to be a common challenge. In the past, due to the smaller scale of trade and lower efficiency requirements, workers could rely on their experience to decide how each box should be placed. However, with the globalization of trade, there is a growing need for fast and stable box stacking, and a good solution for this is robotic palletization [1] [2].
MixDec Sampling: A Soft Link-based Sampling Method of Graph Neural Network for Recommendation
Xie, Xiangjin, Chen, Yuxin, Wang, Ruipeng, Ouyang, Kai, Zhang, Zihan, Zheng, Hai-Tao, Qian, Buyue, Zheng, Hansen, Hu, Bo, Zhuo, Chengxiang, Li, Zang
Graph neural networks have been widely used in recent recommender systems, where negative sampling plays an important role. Existing negative sampling methods restrict the relationship between nodes as either hard positive pairs or hard negative pairs. This leads to the loss of structural information, and lacks the mechanism to generate positive pairs for nodes with few neighbors. To overcome limitations, we propose a novel soft link-based sampling method, namely MixDec Sampling, which consists of Mixup Sampling module and Decay Sampling module. The Mixup Sampling augments node features by synthesizing new nodes and soft links, which provides sufficient number of samples for nodes with few neighbors. The Decay Sampling strengthens the digestion of graph structure information by generating soft links for node embedding learning. To the best of our knowledge, we are the first to model sampling relationships between nodes by soft links in GNN-based recommender systems. Extensive experiments demonstrate that the proposed MixDec Sampling can significantly and consistently improve the recommendation performance of several representative GNN-based models on various recommendation benchmarks.
Active Advantage-Aligned Online Reinforcement Learning with Offline Data
Liu, Xuefeng, Le, Hung T. C., Chen, Siyu, Stevens, Rick, Yang, Zhuoran, Walter, Matthew R., Chen, Yuxin
Online reinforcement learning (RL) enhances policies through direct interactions with the environment, but faces challenges related to sample efficiency. In contrast, offline RL leverages extensive pre-collected data to learn policies, but often produces suboptimal results due to limited data coverage. Recent efforts have sought to integrate offline and online RL in order to harness the advantages of both approaches. However, effectively combining online and offline RL remains challenging due to issues that include catastrophic forgetting, lack of robustness and sample efficiency. In an effort to address these challenges, we introduce A3 RL , a novel method that actively selects data from combined online and offline sources to optimize policy improvement. We provide theoretical guarantee that validates the effectiveness our active sampling strategy and conduct thorough empirical experiments showing that our method outperforms existing state-of-the-art online RL techniques that utilize offline data. Our code will be publicly available at: https://github.com/xuefeng-cs/A3RL.
DrugImproverGPT: A Large Language Model for Drug Optimization with Fine-Tuning via Structured Policy Optimization
Liu, Xuefeng, Jiang, Songhao, Chen, Siyu, Yang, Zhuoran, Chen, Yuxin, Foster, Ian, Stevens, Rick
Finetuning a Large Language Model (LLM) is crucial for generating results towards specific objectives. This research delves into the realm of drug optimization and introduce a novel reinforcement learning algorithm to finetune a drug optimization LLM-based generative model, enhancing the original drug across target objectives, while retains the beneficial chemical properties of the original drug. This work is comprised of two primary components: (1) DrugImprover: A framework tailored for improving robustness and efficiency in drug optimization. It includes a LLM designed for drug optimization and a novel Structured Policy Optimization (SPO) algorithm, which is theoretically grounded. This algorithm offers a unique perspective for fine-tuning the LLM-based generative model by aligning the improvement of the generated molecule with the input molecule under desired objectives. (2) A dataset of 1 million compounds, each with OEDOCK docking scores on 5 human proteins associated with cancer cells and 24 binding sites from SARS-CoV-2 virus. We conduct a comprehensive evaluation of SPO and demonstrate its effectiveness in improving the original drug across target properties. Our code and dataset will be publicly available at: https://github.com/xuefeng-cs/DrugImproverGPT.
Online Reward-Weighted Fine-Tuning of Flow Matching with Wasserstein Regularization
Fan, Jiajun, Shen, Shuaike, Cheng, Chaoran, Chen, Yuxin, Liang, Chumeng, Liu, Ge
Recent advancements in reinforcement learning (RL) have achieved great success in fine-tuning diffusion-based generative models. However, fine-tuning continuous flow-based generative models to align with arbitrary user-defined reward functions remains challenging, particularly due to issues such as policy collapse from overoptimization and the prohibitively high computational cost of likelihoods in continuous-time flows. In this paper, we propose an easy-to-use and theoretically sound RL fine-tuning method, which we term Online Reward-Weighted Conditional Flow Matching with Wasserstein-2 Regularization (ORW-CFM-W2). Our method integrates RL into the flow matching framework to fine-tune generative models with arbitrary reward functions, without relying on gradients of rewards or filtered datasets. By introducing an online reward-weighting mechanism, our approach guides the model to prioritize high-reward regions in the data manifold. To prevent policy collapse and maintain diversity, we incorporate Wasserstein-2 (W2) distance regularization into our method and derive a tractable upper bound for it in flow matching, effectively balancing exploration and exploitation of policy optimization. We provide theoretical analyses to demonstrate the convergence properties and induced data distributions of our method, establishing connections with traditional RL algorithms featuring Kullback-Leibler (KL) regularization and offering a more comprehensive understanding of the underlying mechanisms and learning behavior of our approach. Extensive experiments on tasks including target image generation, image compression, and text-image alignment demonstrate the effectiveness of our method, where our method achieves optimal policy convergence while allowing controllable trade-offs between reward maximization and diversity preservation.
Low-dimensional adaptation of diffusion models: Convergence in total variation
Liang, Jiadong, Huang, Zhihan, Chen, Yuxin
This paper investigates how diffusion generative models leverage (unknown) low-dimensional structure to accelerate sampling. Focusing on two mainstream samplers -- the denoising diffusion implicit model (DDIM) and the denoising diffusion probabilistic model (DDPM) -- and assuming accurate score estimates, we prove that their iteration complexities are no greater than the order of $k/\varepsilon$ (up to some log factor), where $\varepsilon$ is the precision in total variation distance and $k$ is some intrinsic dimension of the target distribution. Our results are applicable to a broad family of target distributions without requiring smoothness or log-concavity assumptions. Further, we develop a lower bound that suggests the (near) necessity of the coefficients introduced by Ho et al.(2020) and Song et al.(2020) in facilitating low-dimensional adaptation. Our findings provide the first rigorous evidence for the adaptivity of the DDIM-type samplers to unknown low-dimensional structure, and improve over the state-of-the-art DDPM theory regarding total variation convergence.
Beyond Reward Hacking: Causal Rewards for Large Language Model Alignment
Wang, Chaoqi, Zhao, Zhuokai, Jiang, Yibo, Chen, Zhaorun, Zhu, Chen, Chen, Yuxin, Liu, Jiayi, Zhang, Lizhu, Fan, Xiangjun, Ma, Hao, Wang, Sinong
Recent advancements in large language models (LLMs) have demonstrated remarkable capabilities in generating coherent, contextually appropriate responses across a wide range of tasks (Brown et al., 2020). A key approach to further refine these models is Reinforcement Learning from Human Feedback (RLHF), which leverages human evaluations to guide the training process and align model outputs more closely with human preferences (Stiennon et al., 2020; Ouyang et al., 2022; Bai et al., 2022; Wang et al., 2024). RLHF typically involves training a reward model to capture human preferences, which is then used to fine-tune LLMs via reinforcement learning (RL) (Schulman et al., 2017; Chen et al., 2024b,f). Despite the success of RLHF, reward modeling is inherently prone to spurious correlations, which are associations in the training data that do not reflect true causal relationships (Veitch et al., 2021), and can lead to unintended biases and induce reward hacking (McMilin, 2022). Reward hacking occurs when RL agents exploit flaws or ambiguities in the reward function to maximize rewards without genuinely improving alignment with desired behaviors or completing designed tasks (Amodei et al., 2016; Weng, 2024). Consequently, this leads to misaligned models that exhibit biases such as favoring longer outputs (length bias) (Zheng et al., 2023), agreeing with user's incorrect assertions (sycophancy bias) (Perez et al., 2022), developing unintended shortcuts when making predictions (concept bias) (Zhou et al., 2023), and implicitly developing discrimination over certain demographic groups (discrimination bias) (Tamkin et al., 2023; Chen et al., 2024c). These biases, rooted in spurious correlations and reward hacking rather than true causal relationships, undermine the reliability and trustworthiness of LLMs, posing significant challenges for their safe and responsible deployment in real-world applications (Anwar et al., 2024; Qi et al., 2024). To understand and mitigate these issues, it is essential to consider the sources of error in reward modeling.