AITopics

Technology: Information Technology > Artificial Intelligence > Representation & Reasoning > Mathematical & Statistical Methods (0.94)

Neural Information Processing SystemsOct-10-2024, 00:21:23 GMT

Understanding the Role of Momentum in Stochastic Gradient Methods

The use of momentum in stochastic gradient methods has become a widespread practice in machine learning. Different variants of momentum, including heavy-ball momentum, Nesterov's accelerated gradient (NAG), and quasi-hyperbolic momentum (QHM), have demonstrated success on various tasks. Despite these empirical successes, there is a lack of clear understanding of how the momentum parameters affect convergence and various performance measures of different algorithms. In this paper, we use the general formulation of QHM to give a unified analysis of several popular algorithms, covering their asymptotic convergence conditions, stability regions, and properties of their stationary distributions. In addition, by combining the results on convergence rates and stationary distributions, we obtain sometimes counter-intuitive practical guidelines for setting the learning rate and momentum parameters.

momentum, stationary distribution, stochastic gradient method, (1 more...)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Mathematical & Statistical Methods (0.67)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.67)

Yin, George, Krishnamurthy, Vikram

Finite Sample and Large Deviations Analysis of Stochastic Gradient Algorithm with Correlated Noise

This paper focuses on finite sample analysis for stochastic gradient algorithms. The motivation stems from a vast varieties of applications. In particular, the recent advances on stochastic optimization in conjunction with machine learning have opened up new domains. A particular emphasis of the learning community requires us taking a careful look at of the finite sample analysis. Well, it is well known that stochastic gradient algorithms or stochastic approximation algorithms are normally concentrated on dealing with asymptotic properties of the recursive algorithms. However, the learning community placed more effort for carrying out analysis of finite sample properties of the recursive algorithms; see for example,... and references therein.

artificial intelligence, machine learning, stochastic gradient algorithm, (14 more...)

2410.08449

Country:

North America > United States > New York (0.04)
North America > United States > Connecticut (0.04)

Genre: Research Report (0.40)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Mathematical & Statistical Methods (0.82)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.82)

Makur, Anuran, Singh, Japneet

Minimax Hypothesis Testing for the Bradley-Terry-Luce Model

The Bradley-Terry-Luce (BTL) model is one of the most widely used models for ranking a collection of items or agents based on pairwise comparisons among them. In this work, our objective is to formulate a hypothesis test that determines whether a given pairwise comparison dataset, with k comparisons per pair of agents, originates from an underlying BTL model. We formalize this testing problem in the minimax sense and define the critical threshold of the problem. We then establish upper bounds on the critical threshold for general induced observation graphs (satisfying mild assumptions) and develop lower bounds for complete induced graphs. In particular, our test statistic for the upper bounds is based on a new approximation we derive for the separation distance between general pairwise comparison models and the class of BTL models. To further assess the performance of our statistical test, we prove upper bounds on the type I and type II probabilities of error. Much of our analysis is conducted within the context of a fixed observation graph structure, where the graph possesses certain "nice" properties, such as expansion and bounded principal ratio. Finally, we conduct several experiments on synthetic and real-world datasets to validate some of our theoretical results. Moreover, we also propose an approach based on permutation testing to determine the threshold of our test in a data-driven manner in these experiments. In recent years, the availability of pairwise comparison data and its subsequent analysis has significantly increased across diverse domains. Pairwise comparison data consists of information gathered in the form of comparisons made among a given set of items or agents. Many real-world applications, including sports tournaments, consumer preference surveys, and political voting, generate data in the form of pairwise comparisons. Such datasets serve a range of purposes, such as ranking items [2]-[12], analyzing team performance over time [13], studying market or sports competitiveness [14], [15], and even fine-tuning large language models using reinforcement learning from human feedback [16], [17]. A popular modeling assumption while performing such learning and inference tasks with pairwise comparison data is to assume that the data conforms to an underlying Bradley-Terry-Luce (BTL) model [2]-[6] as a generative model for the data. P(i is preferred over j) = . The BTL model is known to be a natural consequence of the assumption of independence of irrelevant alternatives (IIA), which is widely used in economics and social choice theory [3].

artificial intelligence, btl model, machine learning, (16 more...)

2410.0836

Country:

Europe > Austria > Vienna (0.14)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
North America > United States > New York > New York County > New York City (0.04)
(11 more...)

Genre: Research Report (1.00)

Industry: Leisure & Entertainment > Sports > Cricket (0.45)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Mathematical & Statistical Methods (0.68)
Information Technology > Artificial Intelligence > Representation & Reasoning > Search (0.61)

Towards Foundation Models for Mixed Integer Linear Programming

Li, Sirui, Kulkarni, Janardhan, Menache, Ishai, Wu, Cathy, Li, Beibin

Mixed Integer Linear Programming (MILP) is essential for modeling complex decision-making problems but faces challenges in computational tractability and requires expert formulation. Current deep learning approaches for MILP focus on specific problem classes and do not generalize to unseen classes. To address this shortcoming, we take a foundation model training approach, where we train a single deep learning model on a diverse set of MILP problems to generalize across problem classes. As existing datasets for MILP lack diversity and volume, we introduce MILP-Evolve, a novel LLM-based evolutionary framework that is capable of generating a large set of diverse MILP classes with an unlimited amount of instances. We study our methodology on three key learning tasks that capture diverse aspects of MILP: (1) integrality gap prediction, (2) learning to branch, and (3) a new task of aligning MILP instances with natural language descriptions. Our empirical results show that models trained on the data generated by MILP-Evolve achieve significant improvements on unseen problems, including MIPLIB benchmarks. Our work highlights the potential of moving towards a foundation model approach for MILP that can generalize to a broad range of MILP applications. We are committed to fully open-sourcing our work to advance further research.

artificial intelligence, machine learning, milp, (18 more...)

2410.08288

Country:

Oceania > Australia > Victoria (0.04)
Oceania > Australia > South Australia (0.04)
North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
(6 more...)

Genre: Research Report > New Finding (0.47)

Industry:

Energy (1.00)
Health & Medicine > Pharmaceuticals & Biotechnology (0.45)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Mathematical & Statistical Methods (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

De Toni, Giovanni, Teso, Stefano, Lepri, Bruno, Passerini, Andrea

Time Can Invalidate Algorithmic Recourse

Algorithmic Recourse (AR) aims to provide users with actionable steps to overturn unfavourable decisions made by machine learning predictors. However, these actions often take time to implement (e.g., getting a degree can take years), and their effects may vary as the world evolves. Thus, it is natural to ask for recourse that remains valid in a dynamic environment. In this paper, we study the robustness of algorithmic recourse over time by casting the problem through the lens of causality. We demonstrate theoretically and empirically that (even robust) causal AR methods can fail over time except in the - unlikely - case that the world is stationary. Even more critically, unless the world is fully deterministic, counterfactual AR cannot be solved optimally. To account for this, we propose a simple yet effective algorithm for temporal AR that explicitly accounts for time. Our simulations on synthetic and realistic datasets show how considering time produces more resilient solutions to potential trends in the data distribution.

intervention, recourse, stochastic process, (12 more...)

2410.08007

Country:

Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
North America > United States > New York > New York County > New York City (0.04)
Asia (0.04)

Genre: Research Report > New Finding (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Mathematical & Statistical Methods (0.31)

Feinberg, Brooke, Li, Aiwen

Heating Up Quasi-Monte Carlo Graph Random Features: A Diffusion Kernel Perspective

We build upon a recently introduced class of quasi-graph random features (q-GRFs), which have demonstrated the ability to yield lower variance estimators of the 2-regularized Laplacian kernel (Choromanski 2023). Our research investigates whether similar results can be achieved with alternative kernel functions, specifically the Diffusion (or Heat), Mat\'ern, and Inverse Cosine kernels. We find that the Diffusion kernel performs most similarly to the 2-regularized Laplacian, and we further explore graph types that benefit from the previously established antithetic termination procedure. Specifically, we explore Erd\H{o}s-R\'enyi and Barab\'asi-Albert random graph models, Binary Trees, and Ladder graphs, with the goal of identifying combinations of specific kernel and graph type that benefit from antithetic termination. We assert that q-GRFs achieve lower variance estimators of the Diffusion (or Heat) kernel on Ladder graphs. However, the number of rungs on the Ladder graphs impacts the algorithm's performance; further theoretical results supporting our experimentation are forthcoming. This work builds upon some of the earliest Quasi-Monte Carlo methods for kernels defined on combinatorial objects, paving the way for kernel-based learning algorithms and future real-world applications in various domains.

artificial intelligence, kernel, machine learning, (14 more...)

2410.08389

Country: North America > United States (0.28)

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Kernel Methods (0.35)
Information Technology > Artificial Intelligence > Representation & Reasoning > Mathematical & Statistical Methods (0.35)

Neural Information Processing SystemsOct-9-2024, 19:20:25 GMT

Primal Dual Interpretation of the Proximal Stochastic Gradient Langevin Algorithm

We consider the task of sampling with respect to a log concave probability distribution. The potential of the target distribution is assumed to be composite, i.e., written as the sum of a smooth convex term, and a nonsmooth convex term possibly taking infinite values. The target distribution can be seen as a minimizer of the Kullback-Leibler divergence defined on the Wasserstein space (i.e., the space of probability measures). In the first part of this paper, we establish a strong duality result for this minimization problem. In the second part of this paper, we use the duality gap arising from the first part to study the complexity of the Proximal Stochastic Gradient Langevin Algorithm (PSGLA), which can be seen as a generalization of the Projected Langevin Algorithm.

langevin algorithm, primal dual interpretation, proximal stochastic gradient langevin algorithm, (5 more...)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Mathematical & Statistical Methods (0.65)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.65)

Neural Information Processing SystemsOct-9-2024, 17:43:10 GMT

Exploiting Local Convergence of Quasi-Newton Methods Globally: Adaptive Sample Size Approach

In this paper, we study the application of quasi-Newton methods for solving empirical risk minimization (ERM) problems defined over a large dataset. Traditional deterministic and stochastic quasi-Newton methods can be executed to solve such problems; however, it is known that their global convergence rate may not be better than first-order methods, and their local superlinear convergence only appears towards the end of the learning process. In this paper, we use an adaptive sample size scheme that exploits the superlinear convergence of quasi-Newton methods globally and throughout the entire learning process. The main idea of the proposed adaptive sample size algorithms is to start with a small subset of data points and solve their corresponding ERM problem within its statistical accuracy, and then enlarge the sample size geometrically and use the optimal solution of the problem corresponding to the smaller set as an initial point for solving the subsequent ERM problem with more samples. We show that if the initial sample size is sufficiently large and we use quasi-Newton methods to solve each subproblem, the subproblems can be solved superlinearly fast (after at most three iterations), as we guarantee that the iterates always stay within a neighborhood that quasi-Newton methods converge superlinearly.

adaptive sample size approach, exploiting local convergence, quasi-newton method globally, (2 more...)

Technology: Information Technology > Artificial Intelligence > Representation & Reasoning > Mathematical & Statistical Methods (1.00)

Neural Information Processing SystemsOct-9-2024, 17:21:42 GMT

Random Quadratic Forms with Dependence: Applications to Restricted Isometry and Beyond

Several important families of computational and statistical results in machine learning and randomized algorithms rely on uniform bounds on quadratic forms of random vectors or matrices. Such results include the Johnson-Lindenstrauss (J-L) Lemma, the Restricted Isometry Property (RIP), randomized sketching algorithms, and approximate linear algebra. The existing results critically depend on statistical independence, e.g., independent entries for random vectors, independent rows for random matrices, etc., which prevent their usage in dependent or adaptive modeling settings. In this paper, we show that such independence is in fact not needed for such results which continue to hold under fairly general dependence structures. In particular, we present uniform bounds on random quadratic forms of stochastic processes which are conditionally independent and sub-Gaussian given another (latent) process.

random quadratic form, restricted isometry, stochastic process, (5 more...)

Technology: Information Technology > Artificial Intelligence > Representation & Reasoning > Mathematical & Statistical Methods (0.76)