Goto

Collaborating Authors

 proximal




Closed-Form Last Layer Optimization

Galashov, Alexandre, Da Costa, Nathaël, Xu, Liyuan, Hennig, Philipp, Gretton, Arthur

arXiv.org Machine Learning

Neural networks are typically optimized with variants of stochastic gradient descent. Under a squared loss, however, the optimal solution to the linear last layer weights is known in closed-form. We propose to leverage this during optimization, treating the last layer as a function of the backbone parameters, and optimizing solely for these parameters. We show this is equivalent to alternating between gradient descent steps on the backbone and closed-form updates on the last layer. We adapt the method for the setting of stochastic gradient descent, by trading off the loss on the current batch against the accumulated information from previous batches. Further, we prove that, in the Neural Tangent Kernel regime, convergence of this method to an optimal solution is guaranteed. Finally, we demonstrate the effectiveness of our approach compared with standard SGD on a squared loss in several supervised tasks -- both regression and classification -- including Fourier Neural Operators and Instrumental Variable Regression.



An Accelerated Bregman Algorithm for ReLU-based Symmetric Matrix Decomposition

Wang, Qingsong

arXiv.org Artificial Intelligence

Symmetric matrix decomposition is an active research area in machine learning. This paper focuses on exploiting the low-rank structure of non-negative and sparse symmetric matrices via the rectified linear unit (ReLU) activation function. We propose the ReLU-based nonlinear symmetric matrix decomposition (ReLU-NSMD) model, introduce an accelerated alternating partial Bregman (AAPB) method for its solution, and present the algorithm's convergence results. Our algorithm leverages the Bregman proximal gradient framework to overcome the challenge of estimating the global $L$-smooth constant in the classic proximal gradient algorithm. Numerical experiments on synthetic and real datasets validate the effectiveness of our model and algorithm.


An Accelerated Alternating Partial Bregman Algorithm for ReLU-based Matrix Decomposition

Wang, Qingsong, Qu, Yunfei, Cui, Chunfeng, Han, Deren

arXiv.org Artificial Intelligence

Despite the remarkable success of low-rank estimation in data mining, its effectiveness diminishes when applied to data that inherently lacks low-rank structure. To address this limitation, in this paper, we focus on non-negative sparse matrices and aim to investigate the intrinsic low-rank characteristics of the rectified linear unit (ReLU) activation function. We first propose a novel nonlinear matrix decomposition framework incorporating a comprehensive regularization term designed to simultaneously promote useful structures in clustering and compression tasks, such as low-rankness, sparsity, and non-negativity in the resulting factors. This formulation presents significant computational challenges due to its multi-block structure, non-convexity, non-smoothness, and the absence of global gradient Lipschitz continuity. To address these challenges, we develop an accelerated alternating partial Bregman proximal gradient method (AAPB), whose distinctive feature lies in its capability to enable simultaneous updates of multiple variables. Under mild and theoretically justified assumptions, we establish both sublinear and global convergence properties of the proposed algorithm. Through careful selection of kernel generating distances tailored to various regularization terms, we derive corresponding closed-form solutions while maintaining the $L$-smooth adaptable property always holds for any $L\ge 1$. Numerical experiments, on graph regularized clustering and sparse NMF basis compression confirm the effectiveness of our model and algorithm.


Combining Wasserstein-1 and Wasserstein-2 proximals: robust manifold learning via well-posed generative flows

Gu, Hyemin, Katsoulakis, Markos A., Rey-Bellet, Luc, Zhang, Benjamin J.

arXiv.org Machine Learning

We formulate well-posed continuous-time generative flows for learning distributions that are supported on low-dimensional manifolds through Wasserstein proximal regularizations of $f$-divergences. Wasserstein-1 proximal operators regularize $f$-divergences so that singular distributions can be compared. Meanwhile, Wasserstein-2 proximal operators regularize the paths of the generative flows by adding an optimal transport cost, i.e., a kinetic energy penalization. Via mean-field game theory, we show that the combination of the two proximals is critical for formulating well-posed generative flows. Generative flows can be analyzed through optimality conditions of a mean-field game (MFG), a system of a backward Hamilton-Jacobi (HJ) and a forward continuity partial differential equations (PDEs) whose solution characterizes the optimal generative flow. For learning distributions that are supported on low-dimensional manifolds, the MFG theory shows that the Wasserstein-1 proximal, which addresses the HJ terminal condition, and the Wasserstein-2 proximal, which addresses the HJ dynamics, are both necessary for the corresponding backward-forward PDE system to be well-defined and have a unique solution with provably linear flow trajectories. This implies that the corresponding generative flow is also unique and can therefore be learned in a robust manner even for learning high-dimensional distributions supported on low-dimensional manifolds. The generative flows are learned through adversarial training of continuous-time flows, which bypasses the need for reverse simulation. We demonstrate the efficacy of our approach for generating high-dimensional images without the need to resort to autoencoders or specialized architectures.


Learning heavy-tailed distributions with Wasserstein-proximal-regularized $\alpha$-divergences

Chen, Ziyu, Gu, Hyemin, Katsoulakis, Markos A., Rey-Bellet, Luc, Zhu, Wei

arXiv.org Machine Learning

Heavy tails are ubiquitous, emerging in various fields such as extreme events in ocean waves [9], floods [21], social sciences [27, 16], human activities [17, 35], biology [18] and computer sciences [29]. Learning to generate heavy-tailed target distributions has been explored using GANs through tail estimation [10, 15, 1]. While estimating the tail behavior of a heavy-tailed distribution is important, selecting objectives that measure discrepancies between these distributions and facilitate stable learning is equally crucial. In generative modeling, the goal is to generate samples that mimic those from an underlying data distribution, typically by designing algorithms that minimize a probability divergence between the generated and target distributions. Thus, it is crucial to choose a divergence that flexibly and accurately respects the behavior of the data distribution.


LLM-Assisted Rule Based Machine Translation for Low/No-Resource Languages

Coleman, Jared, Krishnamachari, Bhaskar, Iskarous, Khalil, Rosales, Ruben

arXiv.org Artificial Intelligence

We propose a new paradigm for machine translation that is particularly useful for no-resource languages (those without any publicly available bilingual or monolingual corpora): LLM-RBMT (LLM-Assisted Rule Based Machine Translation). Using the LLM-RBMT paradigm, we design the first language education/revitalization-oriented machine translator for Owens Valley Paiute (OVP), a critically endangered Indigenous American language for which there is virtually no publicly available data. We present a detailed evaluation of the translator's components: a rule-based sentence builder, an OVP to English translator, and an English to OVP translator. We also discuss the potential of the paradigm, its limitations, and the many avenues for future research that it opens up.


Learning to Control an Unstable System with Forward Modeling

Neural Information Processing Systems

The forward modeling approach is a methodology for learning con(cid:173) trol when data is available in distal coordinate systems. We extend previous work by considering how this methodology can be applied to the optimization of quantities that are distal not only in space but also in time. In many learning control problems, the output variables of the controller are not the natural coordinates in which to specify tasks and evaluate performance. Tasks are generally more naturally specified in "distal" coordinate systems (e.g., endpoint coordinates for manipulator motion) than in the "proximal" coordinate system of the controller (e.g., joint angles or torques). Furthermore, the relationship between proximal coordinates and distal coordinates is often not known a priori and, if known, not easily inverted.