Goto

Collaborating Authors

 Gradient Descent


TRBoost: A Generic Gradient Boosting Machine based on Trust-region Method

arXiv.org Artificial Intelligence

Gradient Boosting Machines (GBMs) have demonstrated remarkable success in solving diverse problems by utilizing Taylor expansions in functional space. However, achieving a balance between performance and generality has posed a challenge for GBMs. In particular, gradient descent-based GBMs employ the first-order Taylor expansion to ensure applicability to all loss functions, while Newton's method-based GBMs use positive Hessian information to achieve superior performance at the expense of generality. To address this issue, this study proposes a new generic Gradient Boosting Machine called Trust-region Boosting (TRBoost). In each iteration, TRBoost uses a constrained quadratic model to approximate the objective and applies the Trust-region algorithm to solve it and obtain a new learner. Unlike Newton's method-based GBMs, TRBoost does not require the Hessian to be positive definite, thereby allowing it to be applied to arbitrary loss functions while still maintaining competitive performance similar to second-order algorithms. The convergence analysis and numerical experiments conducted in this study confirm that TRBoost is as general as first-order GBMs and yields competitive results compared to second-order GBMs. Overall, TRBoost is a promising approach that balances performance and generality, making it a valuable addition to the toolkit of machine learning practitioners.


ADI: Adversarial Dominating Inputs in Vertical Federated Learning Systems

arXiv.org Artificial Intelligence

Vertical federated learning (VFL) system has recently become prominent as a concept to process data distributed across many individual sources without the need to centralize it. Multiple participants collaboratively train models based on their local data in a privacy-aware manner. To date, VFL has become a de facto solution to securely learn a model among organizations, allowing knowledge to be shared without compromising privacy of any individuals. Despite the prosperous development of VFL systems, we find that certain inputs of a participant, named adversarial dominating inputs (ADIs), can dominate the joint inference towards the direction of the adversary's will and force other (victim) participants to make negligible contributions, losing rewards that are usually offered regarding the importance of their contributions in federated learning scenarios. We conduct a systematic study on ADIs by first proving their existence in typical VFL systems. We then propose gradient-based methods to synthesize ADIs of various formats and exploit common VFL systems. We further launch greybox fuzz testing, guided by the saliency score of ``victim'' participants, to perturb adversary-controlled inputs and systematically explore the VFL attack surface in a privacy-preserving manner. We conduct an in-depth study on the influence of critical parameters and settings in synthesizing ADIs. Our study reveals new VFL attack opportunities, promoting the identification of unknown threats before breaches and building more secure VFL systems.


The Dynamics of Sharpness-Aware Minimization: Bouncing Across Ravines and Drifting Towards Wide Minima

arXiv.org Artificial Intelligence

The broad practical impact of deep learning has heightened interest in many of its surprising characteristics: simple gradient methods applied to deep neural networks seem to efficiently optimize nonconvex criteria, reliably giving a near-perfect fit to training data, but exhibiting good predictive accuracy nonetheless [see Bartlett et al., 2021]. Optimization methodology is widely believed to affect statistical performance by imposing some kind of implicit regularization, and there has been considerable effort devoted to understanding the behavior of optimization methods and the nature of solutions that they find. For instance, Barrett and Dherin [2020] and Smith et al. [2021] show that discrete-time gradient descent and stochastic gradient descent can be viewed as gradient flow methods applied to penalized losses that encourage smoothness, and Soudry et al. [2018] amd Azulay et al. [2021] identify the implicit regularization imposed by gradient flow in specific examples, including linear networks. We consider Sharpness-Aware Minimization (SAM), a recently introduced [Foret et al., 2021] gradient optimization method that has exhibited substantial improvements in prediction performance for deep networks applied to image classification [Foret et al., 2021] and NLP [Bahri et al., 2022] problems. Also affiliated with University of California, Berkeley.


Online Learning with Adversarial Delays โˆ—

Neural Information Processing Systems

We study the performance of standard online learning algorithms when the feedback is delayed by an adversary. We show that online-gradient-descent [1] and follow-the-perturbed-leader [2] achieve regret O( D) in the delayed setting, where D is the sum of delays of each round's feedback. This bound collapses to an optimal O( T) bound in the usual setting of no delays (where D = T). Our main contribution is to show that standard algorithms for online learning already have simple regret bounds in the most general setting of delayed feedback, making adjustments to the analysis and not to the algorithms themselves. Our results help affirm and clarify the success of recent algorithms in optimization and machine learning that operate in a delayed feedback model.


Simulated Annealing in Early Layers Leads to Better Generalization

arXiv.org Artificial Intelligence

Recently, a number of iterative learning methods have been introduced to improve generalization. These typically rely on training for longer periods of time in exchange for improved generalization. LLF (later-layer-forgetting) is a state-of-the-art method in this category. It strengthens learning in early layers by periodically re-initializing the last few layers of the network. Our principal innovation in this work is to use Simulated annealing in EArly Layers (SEAL) of the network in place of re-initialization of later layers. Essentially, later layers go through the normal gradient descent process, while the early layers go through short stints of gradient ascent followed by gradient descent. Extensive experiments on the popular Tiny-ImageNet dataset benchmark and a series of transfer learning and few-shot learning tasks show that we outperform LLF by a significant margin. We further show that, compared to normal training, LLF features, although improving on the target task, degrade the transfer learning performance across all datasets we explored. In comparison, our method outperforms LLF across the same target datasets by a large margin. We also show that the prediction depth of our method is significantly lower than that of LLF and normal training, indicating on average better prediction performance.


Implicit regularization of dropout

arXiv.org Artificial Intelligence

It is important to understand how dropout, a popular regularization method, aids in achieving a good generalization solution during neural network training. In this work, we present a theoretical derivation of an implicit regularization of dropout, which is validated by a series of experiments. Additionally, we numerically study two implications of the implicit regularization, which intuitively rationalizes why dropout helps generalization. Firstly, we find that input weights of hidden neurons tend to condense on isolated orientations trained with dropout. Condensation is a feature in the non-linear learning process, which makes the network less complex. Secondly, we experimentally find that the training with dropout leads to the neural network with a flatter minimum compared with standard gradient descent training, and the implicit regularization is the key to finding flat solutions. Although our theory mainly focuses on dropout used in the last hidden layer, our experiments apply to general dropout in training neural networks. This work points out a distinct characteristic of dropout compared with stochastic gradient descent and serves as an important basis for fully understanding dropout.


Self-Stabilization: The Implicit Bias of Gradient Descent at the Edge of Stability

arXiv.org Artificial Intelligence

Traditional analyses of gradient descent show that when the largest eigenvalue of the Hessian, also known as the sharpness $S(\theta)$, is bounded by $2/\eta$, training is "stable" and the training loss decreases monotonically. Recent works, however, have observed that this assumption does not hold when training modern neural networks with full batch or large batch gradient descent. Most recently, Cohen et al. (2021) observed two important phenomena. The first, dubbed progressive sharpening, is that the sharpness steadily increases throughout training until it reaches the instability cutoff $2/\eta$. The second, dubbed edge of stability, is that the sharpness hovers at $2/\eta$ for the remainder of training while the loss continues decreasing, albeit non-monotonically. We demonstrate that, far from being chaotic, the dynamics of gradient descent at the edge of stability can be captured by a cubic Taylor expansion: as the iterates diverge in direction of the top eigenvector of the Hessian due to instability, the cubic term in the local Taylor expansion of the loss function causes the curvature to decrease until stability is restored. This property, which we call self-stabilization, is a general property of gradient descent and explains its behavior at the edge of stability. A key consequence of self-stabilization is that gradient descent at the edge of stability implicitly follows projected gradient descent (PGD) under the constraint $S(\theta) \le 2/\eta$. Our analysis provides precise predictions for the loss, sharpness, and deviation from the PGD trajectory throughout training, which we verify both empirically in a number of standard settings and theoretically under mild conditions. Our analysis uncovers the mechanism for gradient descent's implicit bias towards stability.


Machine Learning in Three Steps: How to Efficiently Learn It

#artificialintelligence

I have observed two extreme approaches when it comes to aspiring data scientists attempting to learn machine learning algorithms. The first approach involves learning all the intricacies of the algorithms and implementing them from scratch to gain true mastery. The second approach, on the other hand, assumes that the computer will "learn" on its own, rendering the need for the individual to learn the algorithms unnecessary. This leads some to only rely on tools such as the package lazypredict. It is realistic to take an approach between the two extremes when learning machine learning algorithms. However, the question remains, where to start? In this article, I will categorize machine learning algorithms into three categories and provide my humble opinion on what to begin with and what can be skipped. Starting out in machine learning can be overwhelming due to the multitude of available algorithms. Linear regression, support vector machines (SVM), gradient descent, gradient boosting, decision trees, LASSO, ridge, grid search, and many more are some of the algorithms that come to mind when posed with the question.


Optimization with Artificial Neural Network Systems: A Mapping Principle and a Comparison to Gradient Based Methods

Neural Information Processing Systems

A comparison is made to optim(cid:173) ization using gradient-search methods. The perfonnance measure is the settling time from an initial state to a target state. A simple analytical example illustrates a situation where dynamical systems representing artificial neural network methods would settle faster than those representing gradient(cid:173) search. Settling time was investigated for a more complicated optimization problem using com(cid:173) puter simulations. The problem was a simplified version of a problem in medical imaging: deter(cid:173) mining loci of cerebral activity from electromagnetic measurements at the scalp.


Optimization by Mean Field Annealing

Neural Information Processing Systems

Nearly optimal solutions to many combinatorial problems can be found using stochastic simulated annealing. This paper extends the concept of simulated annealing from its original formulation as a Markov process to a new formulation based on mean field theory. Mean field annealing essentially replaces the discrete de(cid:173) grees of freedom in simulated annealing with their average values as computed by the mean field approximation. The net result is that equilibrium at a given temperature is achieved 1-2 orders of magnitude faster than with simulated annealing. A general frame(cid:173) work for the mean field annealing algorithm is derived, and its re(cid:173) lationship to Hopfield networks is shown.