Goto

Collaborating Authors

 margin maximization




On Margin Maximization in Linear and ReLU Networks

Neural Information Processing Systems

The implicit bias of neural networks has been extensively studied in recent years. Lyu and Li (2019) showed that in homogeneous networks trained with the exponential or the logistic loss, gradient flow converges to a KKT point of the max margin problem in parameter space. However, that leaves open the question of whether this point will generally be an actual optimum of the max margin problem. In this paper, we study this question in detail, for several neural network architectures involving linear and ReLU activations. Perhaps surprisingly, we show that in many cases, the KKT point is not even a local optimum of the max margin problem. On the flip side, we identify multiple settings where a local or global optimum can be guaranteed.



f1298750ed09618717f9c10ea8d1d3b0-AuthorFeedback.pdf

Neural Information Processing Systems

We thank the reviewers for the detailed and insightful feedback. The reviewers noted that the paper "target[s] a timely "missing ... a bound on the target accuracy of the final classifier (in analogy to Theorem 3.2 which studies a Clarification on why this is not provided or difficult to provide ... would be useful." Instead, we focus on removing the spurious features. "[theory not surprising because loss] would favor good features due to their correlation with the model", "unclear We respectfully and strongly disagree. In Figure 10, the two losses achieve equivalent empirical performance.


On Margin Maximization in Linear and ReLU Networks

Neural Information Processing Systems

The implicit bias of neural networks has been extensively studied in recent years. Lyu and Li (2019) showed that in homogeneous networks trained with the exponential or the logistic loss, gradient flow converges to a KKT point of the max margin problem in parameter space. However, that leaves open the question of whether this point will generally be an actual optimum of the max margin problem. In this paper, we study this question in detail, for several neural network architectures involving linear and ReLU activations. Perhaps surprisingly, we show that in many cases, the KKT point is not even a local optimum of the max margin problem.


Do Mice Grok? Glimpses of Hidden Progress During Overtraining in Sensory Cortex

arXiv.org Artificial Intelligence

Does learning of task-relevant representations stop when behavior stops changing? Motivated by recent theoretical advances in machine learning and the intuitive observation that human experts continue to learn from practice even after mastery, we hypothesize that task-specific representation learning can continue, even when behavior plateaus. In a novel reanalysis of recently published neural data, we find evidence for such learning in posterior piriform cortex of mice following continued training on a task, long after behavior saturates at near-ceiling performance ("overtraining"). This learning is marked by an increase in decoding accuracy from piriform neural populations and improved performance on held-out generalization tests. We demonstrate that class representations in cortex continue to separate during overtraining, so that examples that were incorrectly classified at the beginning of overtraining can abruptly be correctly classified later on, despite no changes in behavior during that time. We hypothesize this hidden yet rich learning takes the form of approximate margin maximization; we validate this and other predictions in the neural data, as well as build and interpret a simple synthetic model that recapitulates these phenomena. We conclude by showing how this model of late-time feature learning implies an explanation for the empirical puzzle of overtraining reversal in animal learning, where task-specific representations are more robust to particular task changes because the learned features can be reused.


Flavors of Margin: Implicit Bias of Steepest Descent in Homogeneous Neural Networks

arXiv.org Machine Learning

We study the implicit bias of the general family of steepest descent algorithms, which includes gradient descent, sign descent and coordinate descent, in deep homogeneous neural networks. We prove that an algorithm-dependent geometric margin starts increasing once the networks reach perfect training accuracy and characterize the late-stage bias of the algorithms. In particular, we define a generalized notion of stationarity for optimization problems and show that the algorithms progressively reduce a (generalized) Bregman divergence, which quantifies proximity to such stationary points of a margin-maximization problem. We then experimentally zoom into the trajectories of neural networks optimized with various steepest descent algorithms, highlighting connections to the implicit bias of Adam.


9461cce28ebe3e76fb4b931c35a169b0-Reviews.html

Neural Information Processing Systems

In this paper the authors provide an algorithm for directly minimzing 0-1 loss and margin maximization. Most existing machine learning techniques have relied on minimizing a convex upper bound on the 0-1 loss in classification problems. In contrast, in this paper the authors propose a simple greedy algorithm for directly minimizing the 0-1 loss via a combination of weak learners. This is followed by a few steps of direct maximization of margin. The proposed algorithm is then evaluated on a few small low dimensional datasets.


Achieving Margin Maximization Exponentially Fast via Progressive Norm Rescaling

arXiv.org Artificial Intelligence

In this work, we investigate the margin-maximization bias exhibited by gradient-based algorithms in classifying linearly separable data. We present an in-depth analysis of the specific properties of the velocity field associated with (normalized) gradients, focusing on their role in margin maximization. Inspired by this analysis, we propose a novel algorithm called Progressive Rescaling Gradient Descent (PRGD) and show that PRGD can maximize the margin at an {\em exponential rate}. This stands in stark contrast to all existing algorithms, which maximize the margin at a slow {\em polynomial rate}. Specifically, we identify mild conditions on data distribution under which existing algorithms such as gradient descent (GD) and normalized gradient descent (NGD) {\em provably fail} in maximizing the margin efficiently. To validate our theoretical findings, we present both synthetic and real-world experiments. Notably, PRGD also shows promise in enhancing the generalization performance when applied to linearly non-separable datasets and deep neural networks.