Wang, Mengdi
Regularized DeepIV with Model Selection
Li, Zihao, Lan, Hui, Syrgkanis, Vasilis, Wang, Mengdi, Uehara, Masatoshi
In this paper, we study nonparametric estimation of instrumental variable (IV) regressions. While recent advancements in machine learning have introduced flexible methods for IV estimation, they often encounter one or more of the following limitations: (1) restricting the IV regression to be uniquely identified; (2) requiring minimax computation oracle, which is highly unstable in practice; (3) absence of model selection procedure. In this paper, we present the first method and analysis that can avoid all three limitations, while still enabling general function approximation. Specifically, we propose a minimax-oracle-free method called Regularized DeepIV (RDIV) regression that can converge to the least-norm IV solution. Our method consists of two stages: first, we learn the conditional distribution of covariates, and by utilizing the learned distribution, we learn the estimator by minimizing a Tikhonov-regularized loss function. We further show that our method allows model selection procedures that can achieve the oracle rates in the misspecified regime. When extended to an iterative estimator, our method matches the current state-of-the-art convergence rate. Our method is a Tikhonov regularized variant of the popular DeepIV method with a non-parametric MLE first-stage estimator, and our results provide the first rigorous guarantees for this empirically used method, showcasing the importance of regularization which was absent from the original work.
Theoretical Insights for Diffusion Guidance: A Case Study for Gaussian Mixture Models
Wu, Yuchen, Chen, Minshuo, Li, Zihao, Wang, Mengdi, Wei, Yuting
Diffusion models benefit from instillation of task-specific information into the score function to steer the sample generation towards desired properties. Such information is coined as guidance. For example, in text-to-image synthesis, text input is encoded as guidance to generate semantically aligned images. Proper guidance inputs are closely tied to the performance of diffusion models. A common observation is that strong guidance promotes a tight alignment to the task-specific information, while reducing the diversity of the generated samples. In this paper, we provide the first theoretical study towards understanding the influence of guidance on diffusion models in the context of Gaussian mixture models. Under mild conditions, we prove that incorporating diffusion guidance not only boosts classification confidence but also diminishes distribution diversity, leading to a reduction in the differential entropy of the output distribution. Our analysis covers the widely adopted sampling schemes including DDPM and DDIM, and leverages comparison inequalities for differential equations as well as the Fokker-Planck equation that characterizes the evolution of probability density function, which may be of independent theoretical interest.
Double Duality: Variational Primal-Dual Policy Optimization for Constrained Reinforcement Learning
Li, Zihao, Liu, Boyi, Yang, Zhuoran, Wang, Zhaoran, Wang, Mengdi
We study the Constrained Convex Markov Decision Process (MDP), where the goal is to minimize a convex functional of the visitation measure, subject to a convex constraint. Designing algorithms for a constrained convex MDP faces several challenges, including (1) handling the large state space, (2) managing the exploration/exploitation tradeoff, and (3) solving the constrained optimization where the objective and the constraint are both nonlinear functions of the visitation measure. In this work, we present a model-based algorithm, Variational Primal-Dual Policy Optimization (VPDPO), in which Lagrangian and Fenchel duality are implemented to reformulate the original constrained problem into an unconstrained primal-dual optimization. Moreover, the primal variables are updated by model-based value iteration following the principle of Optimism in the Face of Uncertainty (OFU), while the dual variables are updated by gradient ascent. Moreover, by embedding the visitation measure into a finite-dimensional space, we can handle large state spaces by incorporating function approximation. Two notable examples are (1) Kernelized Nonlinear Regulators and (2) Low-rank MDPs. We prove that with an optimistic planning oracle, our algorithm achieves sublinear regret and constraint violation in both cases and can attain the globally optimal policy of the original constrained problem.
MaxMin-RLHF: Towards Equitable Alignment of Large Language Models with Diverse Human Preferences
Chakraborty, Souradip, Qiu, Jiahao, Yuan, Hui, Koppel, Alec, Huang, Furong, Manocha, Dinesh, Bedi, Amrit Singh, Wang, Mengdi
Reinforcement Learning from Human Feedback (RLHF) aligns language models to human preferences by employing a singular reward model derived from preference data. However, such an approach overlooks the rich diversity of human preferences inherent in data collected from multiple users. In this work, we first derive an impossibility result of alignment with single reward RLHF, thereby highlighting its insufficiency in representing diverse human preferences. To provide an equitable solution to the problem, we learn a mixture of preference distributions via an expectation-maximization algorithm and propose a MaxMin alignment objective for policy learning inspired by the Egalitarian principle in social choice theory to better represent diverse human preferences. We elucidate the connection of our proposed approach to distributionally robust optimization and general utility RL, thereby highlighting the generality and robustness of our proposed solution. We present comprehensive experimental results on small-scale (GPT-2) and large-scale language models (with Tulu2-7B) and show the efficacy of the proposed approach in the presence of diversity among human preferences. Our algorithm achieves an average improvement of more than 16% in win-rates over conventional RLHF algorithms and improves the win-rate (accuracy) for minority groups by over 33% without compromising the performance of majority groups, showcasing the robustness and fairness of our approach. We remark that our findings in this work are not only limited to language models but also extend to reinforcement learning in general.
TurboSVM-FL: Boosting Federated Learning through SVM Aggregation for Lazy Clients
Wang, Mengdi, Bodonhelyi, Anna, Bozkir, Efe, Kasneci, Enkelejda
Federated learning is a distributed collaborative machine learning paradigm that has gained strong momentum in recent years. In federated learning, a central server periodically coordinates models with clients and aggregates the models trained locally by clients without necessitating access to local data. Despite its potential, the implementation of federated learning continues to encounter several challenges, predominantly the slow convergence that is largely due to data heterogeneity. The slow convergence becomes particularly problematic in cross-device federated learning scenarios where clients may be strongly limited by computing power and storage space, and hence counteracting methods that induce additional computation or memory cost on the client side such as auxiliary objective terms and larger training iterations can be impractical. In this paper, we propose a novel federated aggregation strategy, TurboSVM-FL, that poses no additional computation burden on the client side and can significantly accelerate convergence for federated classification task, especially when clients are "lazy" and train their models solely for few epochs for next global aggregation. TurboSVM-FL extensively utilizes support vector machine to conduct selective aggregation and max-margin spread-out regularization on class embeddings. We evaluate TurboSVM-FL on multiple datasets including FEMNIST, CelebA, and Shakespeare using user-independent validation with non-iid data distribution. Our results show that TurboSVM-FL can significantly outperform existing popular algorithms on convergence rate and reduce communication rounds while delivering better test metrics including accuracy, F1 score, and MCC.
Assessing the Brittleness of Safety Alignment via Pruning and Low-Rank Modifications
Wei, Boyi, Huang, Kaixuan, Huang, Yangsibo, Xie, Tinghao, Qi, Xiangyu, Xia, Mengzhou, Mittal, Prateek, Wang, Mengdi, Henderson, Peter
Despite these efforts, recent studies have uncovered concerning'jailbreak' scenarios. In these cases, even well-aligned Large language models (LLMs) show inherent models have had their safeguards successfully breached (Albert, brittleness in their safety mechanisms, as evidenced 2023). These jailbreaks can include crafting adversarial by their susceptibility to jailbreaking and prompts (Wei et al., 2023; Jones et al., 2023; Carlini even non-malicious fine-tuning. This study explores et al., 2023; Zou et al., 2023b; Shen et al., 2023; Zhu et al., this brittleness of safety alignment by leveraging 2023; Qi et al., 2023), applying persuasion techniques (Zeng pruning and low-rank modifications. We develop et al., 2024), or manipulating the model's decoding process methods to identify critical regions that are (Huang et al., 2024). Recent studies show that finetuning vital for safety guardrails, and that are disentangled an aligned LLM, even on a non-malicious dataset, from utility-relevant regions at both the neuron can inadvertently weaken a model's safety mechanisms (Qi and rank levels. Surprisingly, the isolated regions et al., 2024; Yang et al., 2023; Zhan et al., 2023). Often, we find are sparse, comprising about 3% at these vulnerabilities apply to both open-access and closedaccess the parameter level and 2.5% at the rank level.
Embedding Large Language Models into Extended Reality: Opportunities and Challenges for Inclusion, Engagement, and Privacy
Bozkir, Efe, Özdel, Süleyman, Lau, Ka Hei Carrie, Wang, Mengdi, Gao, Hong, Kasneci, Enkelejda
Recent developments in computer graphics, hardware, artificial intelligence (AI), and human-computer interaction likely lead to extended reality (XR) devices and setups being more pervasive. While these devices and setups provide users with interactive, engaging, and immersive experiences with different sensing modalities, such as eye and hand trackers, many non-player characters are utilized in a pre-scripted way or by conventional AI techniques. In this paper, we argue for using large language models (LLMs) in XR by embedding them in virtual avatars or as narratives to facilitate more inclusive experiences through prompt engineering according to user profiles and fine-tuning the LLMs for particular purposes. We argue that such inclusion will facilitate diversity for XR use. In addition, we believe that with the versatile conversational capabilities of LLMs, users will engage more with XR environments, which might help XR be more used in everyday life. Lastly, we speculate that combining the information provided to LLM-powered environments by the users and the biometric data obtained through the sensors might lead to novel privacy invasions. While studying such possible privacy invasions, user privacy concerns and preferences should also be investigated. In summary, despite some challenges, embedding LLMs into XR is a promising and novel research area with several opportunities.
Scalable Normalizing Flows Enable Boltzmann Generators for Macromolecules
Kim, Joseph C., Bloore, David, Kapoor, Karan, Feng, Jun, Hao, Ming-Hong, Wang, Mengdi
The Boltzmann distribution of a protein provides a roadmap to all of its functional states. Normalizing flows are a promising tool for modeling this distribution, but current methods are intractable for typical pharmacological targets; they become computationally intractable due to the size of the system, heterogeneity of intramolecular potential energy, and long-range interactions. To remedy these issues, we present a novel flow architecture that utilizes split channels and gated attention to efficiently learn the conformational distribution of proteins defined by internal coordinates. We show that by utilizing a 2-Wasserstein loss, one can smooth the transition from maximum likelihood training to energy-based training, enabling the training of Boltzmann Generators for macromolecules. We evaluate our model and training strategy on villin headpiece HP35(nle-nle), a 35-residue subdomain, and protein G, a 56-residue protein. We demonstrate that standard architectures and training strategies, such as maximum likelihood alone, fail while our novel architecture and multi-stage training strategy are able to model the conformational distributions of protein G and HP35. The structural ensemble of a protein determines its functions. The probabilities of the ground and metastable states of a protein at equilibrium for a given temperature determine the interactions of the protein with other proteins, effectors, and drugs, which are keys for pharmaceutical development. However, enumeration of the equilibrium conformations and their probabilities is infeasible. Since complete knowledge is inaccessible, we must adopt a sampling approach. Conventional approaches toward sampling the equilibrium ensemble rely on Markov-chain Monte Carlo or molecular dynamics (MD). These approaches explore the local energy landscape adjacent a starting point; however, they are limited by their inability to penetrate high energy barriers. In addition, MD simulations are expensive and scale poorly with system size.
Tree Search-Based Evolutionary Bandits for Protein Sequence Optimization
Qiu, Jiahao, Yuan, Hui, Zhang, Jinghong, Chen, Wentao, Wang, Huazheng, Wang, Mengdi
Even with the best and largest pre-trained protein language models such Advances in biotechnology have demonstrated human's unprecedented as ESM-1b [33] and ProGen2 [29], one often needs to explore capabilities to engineer proteins. They make it an almost unknown domain and learn a new function possible to directly design the amino acid sequences that map in order to discover new drugs. This is especially true encode proteins for desired functions, towards improving with antibody engineering. Antibodies have highly diverse biochemical or enzymatic properties such as stability, binding complementarity-determining region (CDR) sequences that affinity, or catalytic activity. Directed evolution (DE), can be altered, resulting in a huge sequence space to explore for example, is a method for exploring new protein designs for optimal properties. The binding of antibodies to their targets with properties of interest and maximal utility, by mimicking are extrinsic properties of antibodies and it is difficult to the natural evolution process. The development of DE accurately model the sequence-binding relationships solely was honored in 2018 with the awarding of the Nobel Prize from the sequences alone. Further, most of the exploration in Chemistry to Frances Arnold for the directed evolution strategies used in practice lack theoretical guarantees. of enzymes, and George Smith and Gregory Winter for the development of phage display [3, 41, 48].
Is Inverse Reinforcement Learning Harder than Standard Reinforcement Learning?
Zhao, Lei, Wang, Mengdi, Bai, Yu
Inverse Reinforcement Learning (IRL) -- the problem of learning reward functions from demonstrations of an \emph{expert policy} -- plays a critical role in developing intelligent systems, such as those that understand and imitate human behavior. While widely used in applications, theoretical understandings of IRL admit unique challenges and remain less developed compared with standard RL theory. For example, it remains open how to do IRL efficiently in standard \emph{offline} settings with pre-collected data, where states are obtained from a \emph{behavior policy} (which could be the expert policy itself), and actions are sampled from the expert policy. This paper provides the first line of results for efficient IRL in vanilla offline and online settings using polynomial samples and runtime. We first design a new IRL algorithm for the offline setting, Reward Learning with Pessimism (RLP), and show that it achieves polynomial sample complexity in terms of the size of the MDP, a concentrability coefficient between the behavior policy and the expert policy, and the desired accuracy. Building on RLP, we further design an algorithm Reward Learning with Exploration (RLE), which operates in a natural online setting where the learner can both actively explore the environment and query the expert policy, and obtain a stronger notion of IRL guarantee from polynomial samples. We establish sample complexity lower bounds for both settings showing that RLP and RLE are nearly optimal. Finally, as an application, we show that the learned reward functions can \emph{transfer} to another target MDP with suitable guarantees when the target MDP satisfies certain similarity assumptions with the original (source) MDP.