Cisneros-Velarde, Pedro
Optimization for Neural Operators can Benefit from Width
Cisneros-Velarde, Pedro, Shrimali, Bhavesh, Banerjee, Arindam
Neural Operators that directly learn mappings between function spaces, such as Deep Operator Networks (DONs) and Fourier Neural Operators (FNOs), have received considerable attention. Despite the universal approximation guarantees for DONs and FNOs, there is currently no optimization convergence guarantee for learning such networks using gradient descent (GD). In this paper, we address this open problem by presenting a unified framework for optimization based on GD and applying it to establish convergence guarantees for both DONs and FNOs. In particular, we show that the losses associated with both of these neural operators satisfy two conditions -- restricted strong convexity (RSC) and smoothness -- that guarantee a decrease on their loss values due to GD. Remarkably, these two conditions are satisfied for each neural operator due to different reasons associated with the architectural differences of the respective models. One takeaway that emerges from the theory is that wider networks should lead to better optimization convergence for both DONs and FNOs. We present empirical results on canonical operator learning problems to support our theoretical results.
Large Language Models can Achieve Social Balance
Cisneros-Velarde, Pedro
Social balance is a concept in sociology which states that if every three individuals in a population achieve certain structures of positive or negative interactions, then the whole population ends up in one faction of positive interactions or divided between two or more antagonistic factions. In this paper, we consider a group of interacting large language models (LLMs) and study how, after continuous interactions, they can achieve social balance. Across three different LLM models, we found that social balance depends on (i) whether interactions are updated based on "relationships", "appraisals", or "opinions"; (ii) whether agents update their interactions based on homophily or influence from their peers; and (iii) the number of simultaneous interactions the LLMs consider. When social balance is achieved, its particular structure of positive or negative interactions depends on these three conditions and are different across LLM models and sizes. The stability of interactions and the justification for their update also vary across models. Thus, social balance is driven by the pre-training and alignment particular to each LLM model.
On the Principles behind Opinion Dynamics in Multi-Agent Systems of Large Language Models
Cisneros-Velarde, Pedro
We study the evolution of opinions inside a population of interacting large language models (LLMs). Every LLM needs to decide how much funding to allocate to an item with three initial possibilities: full, partial, or no funding. We identify biases that drive the exchange of opinions based on the LLM's tendency to (i) find consensus with the other LLM's opinion, (ii) display caution when specifying funding, and (iii) consider ethical concerns in its opinion. We find these biases are affected by the perceived absence of compelling reasons for opinion change, the perceived willingness to engage in discussion, and the distribution of allocation values. Moreover, tensions among biases can lead to the survival of funding for items with negative connotations. We also find that the final distribution of full, partial, and no funding opinions is more diverse when an LLM freely forms its opinion after an interaction than when its opinion is a multiple-choice selection among the three allocation options. In the latter case, consensus or polarization is generally attained. When agents are aware of past opinions, they seek to maintain consistency with them, and more diverse updating rules emerge. Our study is performed using a Llama 3 LLM.
One Policy is Enough: Parallel Exploration with a Single Policy is Near-Optimal for Reward-Free Reinforcement Learning
Cisneros-Velarde, Pedro, Lyu, Boxiang, Koyejo, Sanmi, Kolar, Mladen
Although parallelism has been extensively used in reinforcement learning (RL), the quantitative effects of parallel exploration are not well understood theoretically. We study the benefits of simple parallel exploration for reward-free RL in linear Markov decision processes (MDPs) and two-player zero-sum Markov games (MGs). In contrast to the existing literature, which focuses on approaches that encourage agents to explore a diverse set of policies, we show that using a single policy to guide exploration across all agents is sufficient to obtain an almost-linear speedup in all cases compared to their fully sequential counterpart. Furthermore, we demonstrate that this simple procedure is near-minimax optimal in the reward-free setting for linear MDPs. From a practical perspective, our paper shows that a single policy is sufficient and provably near-optimal for incorporating parallelism during the exploration phase.
Finite-sample Guarantees for Nash Q-learning with Linear Function Approximation
Cisneros-Velarde, Pedro, Koyejo, Sanmi
Nash Q-learning may be considered one of the first and most known algorithms in multi-agent reinforcement learning (MARL) for learning policies that constitute a Nash equilibrium of an underlying general-sum Markov game. Its original proof provided asymptotic guarantees and was for the tabular case. Recently, finite-sample guarantees have been provided using more modern RL techniques for the tabular case. Our work analyzes Nash Q-learning using linear function approximation -- a representation regime introduced when the state space is large or continuous -- and provides finite-sample guarantees that indicate its sample efficiency. We find that the obtained performance nearly matches an existing efficient result for single-agent RL under the same representation and has a polynomial gap when compared to the best-known result for the tabular case.
Restricted Strong Convexity of Deep Learning Models with Smooth Activations
Banerjee, Arindam, Cisneros-Velarde, Pedro, Zhu, Libin, Belkin, Mikhail
We consider the problem of optimization of deep learning models with smooth activation functions. While there exist influential results on the problem from the ``near initialization'' perspective, we shed considerable new light on the problem. In particular, we make two key technical contributions for such models with $L$ layers, $m$ width, and $\sigma_0^2$ initialization variance. First, for suitable $\sigma_0^2$, we establish a $O(\frac{\text{poly}(L)}{\sqrt{m}})$ upper bound on the spectral norm of the Hessian of such models, considerably sharpening prior results. Second, we introduce a new analysis of optimization based on Restricted Strong Convexity (RSC) which holds as long as the squared norm of the average gradient of predictors is $\Omega(\frac{\text{poly}(L)}{\sqrt{m}})$ for the square loss. We also present results for more general losses. The RSC based analysis does not need the ``near initialization" perspective and guarantees geometric convergence for gradient descent (GD). To the best of our knowledge, ours is the first result on establishing geometric convergence of GD based on RSC for deep learning models, thus becoming an alternative sufficient condition for convergence that does not depend on the widely-used Neural Tangent Kernel (NTK). We share preliminary experimental results supporting our theoretical advances.
A Contraction Theory Approach to Optimization Algorithms from Acceleration Flows
Cisneros-Velarde, Pedro, Bullo, Francesco
Problem statement and motivation There has been a recent interest in studying systems of ODEs that solve an optimization problem -- also known as optimization flows -- with the understanding that their study can lead to the analysis and design of discrete-time solvers of optimization problems - also known as optimization algorithms. This interest is motivated by the fact that analyzing a system of ODEs can be much simpler than analyzing a discrete system. Indeed, the ambitious goal of this research area is to find a "general theory mapping properties of ODEs into corresponding properties for discrete updates" -- as quoted from the seminal work [20]. Our paper aims to provide a solution to this problem. Ideally, the desired pipeline is to first design an optimization flow -- using all the machinery of dynamical systems analysis -- with good stability and convergence properties, and then formulate a principled way of guaranteeing such good properties translate to its associated optimization algorithm through discretization. A first problem in the literature is that the analysis of the optimization algorithm is commonly done separately or independently from the analysis of its associated optimization flow (e.g., see [24, 25, 17, 26, 18, 12]), instead of the former analysis following directly as a consequence of the latter one. For example, separate Lyapunov analyses have been made for optimization flows and their associated algorithms. This problem diminishes one of the very first motivations of analyzing a system of ODEs, namely, that its analysis should directly establish properties of its associated discretization.
Distributionally Robust Formulation and Model Selection for the Graphical Lasso
Cisneros-Velarde, Pedro, Oh, Sang-Yun, Petersen, Alexander
Building on a recent framework for distributionally robust optimization in machine learning, we develop a similar framework for estimation of the inverse covariance matrix for multivariate data. We provide a novel notion of a Wasserstein ambiguity set specifically tailored to this estimation problem, from which we obtain a representation for a tractable class of regularized estimators. Special cases include penalized likelihood estimators for Gaussian data, specifically the graphical lasso estimator. As a consequence of this formulation, a natural relationship arises between the radius of the Wasserstein ambiguity set and the regularization parameter in the estimation problem. Using this relationship, one can directly control the level of robustness of the estimation procedure by specifying a desired level of confidence with which the ambiguity set contains a distribution with the true population covariance. Furthermore, a unique feature of our formulation is that the radius can be expressed in closed-form as a function of the ordinary sample covariance matrix. Taking advantage of this finding, we develop a simple algorithm to determine a regularization parameter for graphical lasso, using only the bootstrapped sample covariance matrices, meaning that computationally expensive repeated evaluation of the graphical lasso algorithm is not necessary. Alternatively, the distributionally robust formulation can also quantify the robustness of the corresponding estimator if one uses an off-the-shelf method such as cross-validation. Finally, we numerically study the obtained regularization criterion and analyze the robustness of other automated tuning procedures used in practice.