Goto

Collaborating Authors

 Malviya, Pranshu


Torque-Aware Momentum

arXiv.org Artificial Intelligence

Efficiently exploring complex loss landscapes is key to the performance of deep neural networks. While momentum-based optimizers are widely used in stateof-the-art setups, classical momentum can still struggle with large, misaligned gradients, leading to oscillations. To address this, we propose Torque-Aware Momentum (TAM), which introduces a damping factor based on the angle between the new gradients and previous momentum, stabilizing the update direction during training. Empirical results show that TAM, which can be combined with both SGD and Adam, enhances exploration, handles distribution shifts more effectively, and improves generalization performance across various tasks, including image classification and large language model fine-tuning, when compared to classical momentum-based optimizers. Despite the wide range of optimization methods available in the literature, stochastic gradient descent (SGD), typically augmented with momentum (Kingma & Ba, 2015; Nesterov, 1983; Qian, 1999), remains the go-to approach for practitioners. Momentum accelerates convergence, particularly in the presence of high curvature (Cutkosky & Mehta, 2020b), small but consistent gradients, or noisy gradients. It also helps the optimizer navigate the loss landscape and escape local minima or saddle points by maintaining consistent updates directions (Jin et al., 2018). While SGD with momentum (SGDM) has shown remarkable success in various scenarios, particularly in computer vision (Sutskever et al., 2013), it remains vulnerable to In this work, we propose that minimizing the influence of misaligned gradients during momentum updates can preserve valuable information and improve the exploration Figure 1: Comparing momentum updates capabilities of momentum-based methods. To enable more obtained using SGDM and TAM consistent exploration of the loss landscape, particularly in for a given SGD trajectory.


Predicting the Impact of Model Expansion through the Minima Manifold: A Loss Landscape Perspective

arXiv.org Artificial Intelligence

The optimal model for a given task is often challenging to determine, requiring training multiple models from scratch which becomes prohibitive as dataset and model sizes grow. A more efficient alternative is to reuse smaller pre-trained models by expanding them, however, this is not widely adopted as how this impacts training dynamics remains poorly understood. While prior works have introduced statistics to measure these effects, they remain flawed. To rectify this, we offer a new approach for understanding and quantifying the impact of expansion through the lens of the loss landscape, which has been shown to contain a manifold of linearly connected minima. Building on this new perspective, we propose a metric to study the impact of expansion by estimating the size of the manifold. Experimental results show a clear relationship between gains in performance and manifold size, enabling the comparison of candidate models and presenting a first step towards expanding models more reliably based on geometric properties of the loss landscape.


Lookbehind Optimizer: k steps back, 1 step forward

arXiv.org Artificial Intelligence

The Lookahead optimizer improves the training stability of deep neural networks by having a set of fast weights that "look ahead" to guide the descent direction. Here, we combine this idea with sharpness-aware minimization (SAM) to stabilize its multi-step variant and improve the loss-sharpness trade-off. We propose Lookbehind, which computes $k$ gradient ascent steps ("looking behind") at each iteration and combine the gradients to bias the descent step toward flatter minima. We apply Lookbehind on top of two popular sharpness-aware training methods -- SAM and adaptive SAM (ASAM) -- and show that our approach leads to a myriad of benefits across a variety of tasks and training regimes. Particularly, we show increased generalization performance, greater robustness against noisy weights, and higher tolerance to catastrophic forgetting in lifelong learning settings.


Promoting Exploration in Memory-Augmented Adam using Critical Momenta

arXiv.org Artificial Intelligence

Adaptive gradient-based optimizers, particularly Adam, have left their mark in training large-scale deep learning models. The strength of such optimizers is that they exhibit fast convergence while being more robust to hyperparameter choice. However, they often generalize worse than non-adaptive methods. Recent studies have tied this performance gap to flat minima selection: adaptive methods tend to find solutions in sharper basins of the loss landscape, which in turn hurts generalization. To overcome this issue, we propose a new memory-augmented version of Adam that promotes exploration towards flatter minima by using a buffer of critical momentum terms during training. Intuitively, the use of the buffer makes the optimizer overshoot outside the basin of attraction if it is not wide enough. We empirically show that our method improves the performance of several variants of Adam on standard supervised language modelling and image classification tasks.


TAG: Task-based Accumulated Gradients for Lifelong learning

arXiv.org Artificial Intelligence

When an agent encounters a continual stream of new tasks in the lifelong learning setting, it leverages the knowledge it gained from the earlier tasks to help learn the new tasks better. In such a scenario, identifying an efficient knowledge representation becomes a challenging problem. Most research works propose to either store a subset of examples from the past tasks in a replay buffer, dedicate a separate set of parameters to each task or penalize excessive updates over parameters by introducing a regularization term. While existing methods employ the general task-agnostic stochastic gradient descent update rule, we propose a task-aware optimizer that adapts the learning rate based on the relatedness among tasks. We utilize the directions taken by the parameters during the updates by additively accumulating the gradients specific to each task. These task-based accumulated gradients act as a knowledge base that is maintained and updated throughout the stream. We empirically show that our proposed adaptive learning rate not only accounts for catastrophic forgetting but also exhibits knowledge transfer. We also show that our method performs better than several state-of-the-art methods in lifelong learning on complex datasets. Moreover, our method can also be combined with the existing methods and achieve substantial improvement in performance. Lifelong learning (LLL), also known as continual learning, is a setting where an agent continuously learns from data belonging to different tasks (Parisi et al., 2019). Here, the goal is to maximize performance on all the tasks arriving in a stream without replaying the entire datasets from past tasks (Riemer et al., 2018). Approaches proposed in this setting involve investigating the stability-plasticity dilemma (Mermillod et al., 2013) in different ways where stability refers to preventing the forgetting of past knowledge and plasticity refers to accumulating new knowledge by learning new tasks (Mermillod et al., 2013; Delange et al., 2021). 1 Unlike human beings, who can efficiently assess the correctness and applicability of the past knowledge (Chen & Liu, 2018), neural networks and other machine learning models often face various issues in this setting. Whenever data from a new task arrives, these models often tend to forget the previously obtained knowledge due to dependency on the input data distribution, limited capacity, diversity among tasks, etc.


Fast constraint satisfaction problem and learning-based algorithm for solving Minesweeper

arXiv.org Artificial Intelligence

Minesweeper is a popular spatial-based decision-making game that works with incomplete information. As an exemplary NP-complete problem, it is a major area of research employing various artificial intelligence paradigms. The present work models this game as Constraint Satisfaction Problem (CSP) and Markov Decision Process (MDP). We propose a new method named as dependents from the independent set using deterministic solution search (DSScsp) for the faster enumeration of all solutions of a CSP based Minesweeper game and improve the results by introducing heuristics. Using MDP, we implement machine learning methods on these heuristics. We train the classification model on sparse data with results from CSP formulation. We also propose a new rewarding method for applying a modified deep Q-learning for better accuracy and versatile learning in the Minesweeper game. The overall results have been analyzed for different kinds of Minesweeper games and their accuracies have been recorded. Results from these experiments show that the proposed method of MDP based classification model and deep Q-learning overall is the best methods in terms of accuracy for games with given mine densities.


A Causal Linear Model to Quantify Edge Unfairness for Unfair Edge Prioritization and Discrimination Removal

arXiv.org Artificial Intelligence

The dataset can be generated by an unfair mechanism in numerous settings. For instance, a judicial system is unfair if it rejects the bail plea of an accused based on the race. To mitigate the unfairness in the procedure generating the dataset, we need to identify the sources of unfairness, quantify the unfairness in these sources, quantify how these sources affect the overall unfairness, and prioritize the sources before addressing the real-world issues underlying them. Prior work of (Zhang, et. al, 2017) identifies and removes discrimination after data is generated but does not suggest a methodology to mitigate unfairness in the data generation phase. We use the notion of an unfair edge, same as (Chiappa, et. al, 2018), to be the source of discrimination and quantify unfairness along an unfair edge. We also quantify overall unfairness in a particular decision towards a subset of sensitive attributes in terms of edge unfairness and measure the sensitivity of the former when the latter is varied. Using the formulation of cumulative unfairness in terms of edge unfairness, we alter the discrimination removal methodology discussed in (Zhang, et. al, 2017) by not formulating it as an optimization problem. This helps in getting rid of constraints that grow exponentially in the number of sensitive attributes and values taken by them. Finally, we discuss a priority algorithm for policymakers to address the real-world issues underlying the edges that result in unfairness. The experimental section validates the linear model assumption made to quantify edge unfairness.


Contextual Care Protocol using Neural Networks and Decision Trees

arXiv.org Machine Learning

A contextual care protocol is used by a medical practitioner for patient healthcare, given the context or situation that the specified patient is in. This paper proposes a method to build an automated self-adapting protocol which can help make relevant, early decisions for effective healthcare delivery. The hybrid model leverages neural networks and decision trees. The neural network estimates the chances of each disease and each tree in the decision trees represents care protocol for a disease. These trees are subject to change in case of aberrations found by the diagnosticians. These corrections or prediction errors are clustered into similar groups for scalability and review by the experts. The corrections as suggested by the experts are incorporated into the model.