regularization problem
Reduced-Space Iteratively Reweighted Second-Order Methods for Nonconvex Sparse Regularization
Wang, Hao, Yang, Xiangyu, Zhu, Yichen
This paper explores a specific type of nonconvex sparsity-promoting regularization problems, namely those involving $\ell_p$-norm regularization, in conjunction with a twice continuously differentiable loss function. We propose a novel second-order algorithm designed to effectively address this class of challenging nonconvex and nonsmooth problems, showcasing several innovative features: (i) The use of an alternating strategy to solve a reweighted $\ell_1$ regularized subproblem and the subspace approximate Newton step. (ii) The reweighted $\ell_1$ regularized subproblem relies on a convex approximation to the nonconvex regularization term, enabling a closed-form solution characterized by the soft-thresholding operator. This feature allows our method to be applied to various nonconvex regularization problems. (iii) Our algorithm ensures that the iterates maintain their sign values and that nonzero components are kept away from 0 for a sufficient number of iterations, eventually transitioning to a perturbed Newton method. (iv) We provide theoretical guarantees of global convergence, local superlinear convergence in the presence of the Kurdyka-\L ojasiewicz (KL) property, and local quadratic convergence when employing the exact Newton step in our algorithm. We also showcase the effectiveness of our approach through experiments on a diverse set of model prediction problems.
The representer theorem for Hilbert spaces: a necessary and sufficient condition
The representer theorem is a property that lies at the foundation of regularization theory and kernel methods. A class of regularization functionals is said to admit a linear representer theorem if every member of the class admits minimizers that lie in the finite dimensional subspace spanned by the representers of the data. A recent characterization states that certain classes of regularization functionals with differentiable regularization term admit a linear representer theorem for any choice of the data if and only if the regularization term is a radial nondecreasing function. In this paper, we extend such result by weakening the assumptions on the regularization term. In particular, the main result of this paper implies that, for a sufficiently large family of regularization functionals, radial nondecreasing functions are the only lower semicontinuous regularization terms that guarantee existence of a representer theorem for any choice of the data.
A New Convex Relaxation for Tensor Completion
We study the problem of learning a tensor from a set of linear measurements. A prominent methodology for this problem is based on a generalization of trace norm regularization, which has been used extensively for learning low rank matrices, to the tensor setting. In this paper, we highlight some limitations of this approach and propose an alternative convex relaxation on the Euclidean ball. We then describe a technique to solve the associated regularization problem, which builds upon the alternating direction method of multipliers. Experiments on one synthetic dataset and two real datasets indicate that the proposed method improves significantly over tensor trace norm regularization in terms of estimation error, while remaining computationally tractable.
Hypothesis Spaces for Deep Learning
Wang, Rui, Xu, Yuesheng, Yan, Mingsong
Deep learning has been a huge success in applications. Mathematically, its success is due to the use of deep neural networks (DNNs), neural networks of multiple layers, to describe decision functions. Various mathematical aspects of DNNs as an approximation tool were investigated recently in a number of studies [9, 11, 13, 16, 20, 27, 28, 31]. As pointed out in [8], learning processes do not take place in a vacuum. Classical learning methods took place in a reproducing kernel Hilbert space (RKHS) [1], which leads to representation of learning solutions in terms of a combination of a finite number of kernel sessions [19] of a universal kernel [17]. Reproducing kernel Hilbert spaces as appropriate hypothesis spaces for classical learning methods provide a foundation for mathematical analysis of the learning methods. A natural and imperative question is what are appropriate hypothesis spaces for deep learning. Although hypothesis spaces for learning with shallow neural networks (networks of one hidden layer) were investigated recently in a number of studies, (e.g.
Sparse Representer Theorems for Learning in Reproducing Kernel Banach Spaces
Wang, Rui, Xu, Yuesheng, Yan, Mingsong
Sparsity of a learning solution is a desirable feature in machine learning. Certain reproducing kernel Banach spaces (RKBSs) are appropriate hypothesis spaces for sparse learning methods. The goal of this paper is to understand what kind of RKBSs can promote sparsity for learning solutions. We consider two typical learning models in an RKBS: the minimum norm interpolation (MNI) problem and the regularization problem. We first establish an explicit representer theorem for solutions of these problems, which represents the extreme points of the solution set by a linear combination of the extreme points of the subdifferential set, of the norm function, which is data-dependent. We then propose sufficient conditions on the RKBS that can transform the explicit representation of the solutions to a sparse kernel representation having fewer terms than the number of the observed data. Under the proposed sufficient conditions, we investigate the role of the regularization parameter on sparsity of the regularized solutions. We further show that two specific RKBSs: the sequence space $\ell_1(\mathbb{N})$ and the measure space can have sparse representer theorems for both MNI and regularization models.
On Generalization and Regularization via Wasserstein Distributionally Robust Optimization
Wu, Qinyu, Li, Jonathan Yu-Meng, Mao, Tiantian
Wasserstein distributionally robust optimization (DRO) has found success in operations research and machine learning applications as a powerful means to obtain solutions with favourable out-of-sample performances. Two compelling explanations for the success are the generalization bounds derived from Wasserstein DRO and the equivalency between Wasserstein DRO and the regularization scheme commonly applied in machine learning. Existing results on generalization bounds and the equivalency to regularization are largely limited to the setting where the Wasserstein ball is of a certain type and the decision criterion takes certain forms of an expected function. In this paper, we show that by focusing on Wasserstein DRO problems with affine decision rules, it is possible to obtain generalization bounds and the equivalency to regularization in a significantly broader setting where the Wasserstein ball can be of a general type and the decision criterion can be a general measure of risk, i.e., nonlinear in distributions. This allows for accommodating many important classification, regression, and risk minimization applications that have not been addressed to date using Wasserstein DRO. Our results are strong in that the generalization bounds do not suffer from the curse of dimensionality and the equivalency to regularization is exact. As a byproduct, our regularization results broaden considerably the class of Wasserstein DRO models that can be solved efficiently via regularization formulations.
The Geometry of Adversarial Training in Binary Classification
Bungert, Leon, Trillos, Nicolรกs Garcรญa, Murray, Ryan
We establish an equivalence between a family of adversarial training problems for non-parametric binary classification and a family of regularized risk minimization problems where the regularizer is a nonlocal perimeter functional. The resulting regularized risk minimization problems admit exact convex relaxations of the type $L^1+$ (nonlocal) $\operatorname{TV}$, a form frequently studied in image analysis and graph-based learning. A rich geometric structure is revealed by this reformulation which in turn allows us to establish a series of properties of optimal solutions of the original problem, including the existence of minimal and maximal solutions (interpreted in a suitable sense), and the existence of regular solutions (also interpreted in a suitable sense). In addition, we highlight how the connection between adversarial training and perimeter minimization problems provides a novel, directly interpretable, statistical motivation for a family of regularized risk minimization problems involving perimeter/total variation. The majority of our theoretical results are independent of the distance used to define adversarial attacks.
On implicit regularization: Morse functions and applications to matrix factorization
In this paper, we revisit implicit regularization from the ground up using notions from dynamical systems and invariant subspaces of Morse functions. The key contributions are a new criterion for implicit regularization---a leading contender to explain the generalization power of deep models such as neural networks---and a general blueprint to study it. We apply these techniques to settle a conjecture on implicit regularization in matrix factorization.
A brief introduction to the Grey Machine Learning
This paper presents a brief introduction to the key points of the Grey Machine Learning (GML) based on the kernels. The general formulation of the grey system models have been firstly summarized, and then the nonlinear extension of the grey models have been developed also with general formulations. The kernel implicit mapping is used to estimate the nonlinear function of the GML model, by extending the nonparametric formulation of the LSSVM, the estimation of the nonlinear function of the GML model can also be expressed by the kernels. A short discussion on the priority of this new framework to the existing grey models and LSSVM have also been discussed in this paper. And the perspectives and future orientations of this framework have also been presented.