subgradient descent
- North America > United States > California > San Diego County > San Diego (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- North America > United States > California > San Diego County > La Jolla (0.04)
- Oceania > Australia (0.04)
- North America > United States > California > San Diego County > San Diego (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- North America > United States > California > San Diego County > La Jolla (0.04)
- Oceania > Australia (0.04)
- North America > United States > California > San Diego County > San Diego (0.05)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- North America > United States > California > San Diego County > La Jolla (0.04)
- Oceania > Australia (0.04)
Reviews: Training Deep Networks without Learning Rates Through Coin Betting
Summary: This paper is based on the notion (established in existing works) that subgradient descent can be interpreted as betting on a coin. It generalizes existing results to data-dependent bets and, consequently, data-dependent convergence bounds that improve upon the existing ones. The algorithm is then adapted for the training of neural networks and evaluated on this task experimentally. Quality: The derivations are mathematically sound. I verified the proof of Theorem 1. The changes made to adapt COCOB for the training of neural networks (Algo 1 -- Algo 2) are sensible and intuitive.
Primal Estimated Subgradient Solver for SVM for Imbalanced Classification
We aim to demonstrate in experiments that our cost sensitive PEGASOS SVM achieves good performance on imbalanced data sets with a Majority to Minority Ratio ranging from 8.6:1 to 130:1 and to ascertain whether the including intercept (bias), regularization and parameters affects performance on our selection of datasets. Although many resort to SMOTE methods, we aim for a less computationally intensive method. We evaluate the performance by examining the learning curves. These curves diagnose whether we overfit or underfit or whether the random sample of data chosen during the process was not random enough or diverse enough in dependent variable class for the algorithm to generalized to unseen examples. We will also see the background of the hyperparameters versus the test and train error in validation curves. We benchmark our PEGASOS Cost-Sensitive SVM's results of Ding's LINEAR SVM DECIDL method. He obtained an ROC-AUC of .5 in one dataset. Our work will extend the work of Ding by incorporating kernels into SVM. We will use Python rather than MATLAB as python has dictionaries for storing mixed data types during multi-parameter cross-validation.
Learning Large Scale Sparse Models
Dhingra, Atul, Shen, Jie, Kleene, Nicholas
In this work, we consider learning sparse models in large scale settings, where the number of samples and the feature dimension can grow as large as millions or billions. Two immediate issues occur under such challenging scenario: (i) computational cost; (ii) memory overhead. In particular, the memory issue precludes a large volume of prior algorithms that are based on batch optimization technique. To remedy the problem, we propose to learn sparse models such as Lasso in an online manner where in each iteration, only one randomly chosen sample is revealed to update a sparse iterate. Thereby, the memory cost is independent of the sample size and gradient evaluation for one sample is efficient. Perhaps amazingly, we find that with the same parameter, sparsity promoted by batch methods is not preserved in online fashion. We analyze such interesting phenomenon and illustrate some effective variants including mini-batch methods and a hard thresholding based stochastic gradient algorithm. Extensive experiments are carried out on a public dataset which supports our findings and algorithms.
On the Equivalence between Neural Network and Support Vector Machine
Chen, Yilan, Huang, Wei, Nguyen, Lam M., Weng, Tsui-Wei
Recent research shows that the dynamics of an infinitely wide neural network (NN) trained by gradient descent can be characterized by Neural Tangent Kernel (NTK) \citep{jacot2018neural}. Under the squared loss, the infinite-width NN trained by gradient descent with an infinitely small learning rate is equivalent to kernel regression with NTK \citep{arora2019exact}. However, the equivalence is only known for ridge regression currently \citep{arora2019harnessing}, while the equivalence between NN and other kernel machines (KMs), e.g. support vector machine (SVM), remains unknown. Therefore, in this work, we propose to establish the equivalence between NN and SVM, and specifically, the infinitely wide NN trained by soft margin loss and the standard soft margin SVM with NTK trained by subgradient descent. Our main theoretical results include establishing the equivalence between NN and a broad family of $\ell_2$ regularized KMs with finite-width bounds, which cannot be handled by prior work, and showing that every finite-width NN trained by such regularized loss functions is approximately a KM. Furthermore, we demonstrate our theory can enable three practical applications, including (i) \textit{non-vacuous} generalization bound of NN via the corresponding KM; (ii) \textit{non-trivial} robustness certificate for the infinite-width NN (while existing robustness verification methods would provide vacuous bounds); (iii) intrinsically more robust infinite-width NNs than those from previous kernel regression. Our code for the experiments are available at \url{https://github.com/leslie-CH/equiv-nn-svm}.
- North America > United States > California > San Diego County > San Diego (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- North America > United States > California > San Diego County > La Jolla (0.04)
- Oceania > Australia (0.04)
Minimizing approximately submodular functions
Halabi, Marwa El, Jegelka, Stefanie
The problem of minimizing a submodular function is well studied; several polynomial-time algorithms have been developed to solve it exactly or up to arbitrary accuracy. However, in many applications, the objective functions are not exactly submodular. In this paper, we show that a classical algorithm used for submodular minimization performs well even for a class of non-submodular functions, namely weakly DR-submodular functions. We provide the first approximation guarantee for non-submodular minimization. This broadly expands the range of applications of submodular minimization techniques.
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
- North America > United States > New York > New York County > New York City (0.04)
- Europe > Switzerland > Vaud > Lausanne (0.04)
- (2 more...)
Randomized Smoothing SVRG for Large-scale Nonsmooth Convex Optimization
In this paper, we consider the problem of minimizing the average of a large number of nonsmooth and convex functions. Such problems often arise in typical machine learning problems as empirical risk minimization, but are computationally very challenging. We develop and analyze a new algorithm that achieves robust linear convergence rate, and both its time complexity and gradient complexity are superior than state-of-art nonsmooth algorithms and subgradient-based schemes. Besides, our algorithm works without any extra error bound conditions on the objective function as well as the common strongly-convex condition. We show that our algorithm has wide applications in optimization and machine learning problems, and demonstrate experimentally that it performs well on a large-scale ranking problem.
- Asia > Middle East > Jordan (0.04)
- North America > United States > New York > New York County > New York City (0.04)
- Asia > Singapore (0.04)
Integrated Inference and Learning of Neural Factors in Structural Support Vector Machines
Houthooft, Rein, De Turck, Filip
Tackling pattern recognition problems in areas such as computer vision, bioinformatics, speech or text recognition is often done best by taking into account task-specific statistical relations between output variables. In structured prediction, this internal structure is used to predict multiple outputs simultaneously, leading to more accurate and coherent predictions. Structural support vector machines (SSVMs) are nonprobabilistic models that optimize a joint input-output function through margin-based learning. Because SSVMs generally disregard the interplay between unary and interaction factors during the training phase, final parameters are suboptimal. Moreover, its factors are often restricted to linear combinations of input features, limiting its generalization power. To improve prediction accuracy, this paper proposes: (i) Joint inference and learning by integration of back-propagation and loss-augmented inference in SSVM subgradient descent; (ii) Extending SSVM factors to neural networks that form highly nonlinear functions of input features. Image segmentation benchmark results demonstrate improvements over conventional SSVM training methods in terms of accuracy, highlighting the feasibility of end-to-end SSVM training with neural factors.
- North America > United States > New York > New York County > New York City (0.04)
- North America > United States > Arizona > Maricopa County > Phoenix (0.04)
- Europe > Belgium > Flanders (0.04)