AITopics

2210.11235

Country:

North America > United States > North Carolina (0.04)
North America > United States > Oregon > Lane County > Eugene (0.04)
North America > United States > New York > New York County > New York City (0.04)
(2 more...)

Genre: Research Report > New Finding (0.70)

Industry:

Law Enforcement & Public Safety > Crime Prevention & Enforcement (1.00)
Information Technology > Security & Privacy (1.00)

Technology:

Information Technology > Security & Privacy (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Regression (0.87)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.69)
(3 more...)

Wood, Killian, Dall'Anese, Emiliano

Stochastic Saddle Point Problems with Decision-Dependent Distributions

arXiv.org Artificial IntelligenceNov-11-2022

This paper focuses on stochastic saddle point problems with decision-dependent distributions. These are problems whose objective is the expected value of a stochastic payoff function and whose data distribution drifts in response to decision variables--a phenomenon represented by a distributional map. A common approach to accommodating distributional shift is to retrain optimal decisions once a new distribution is revealed, or repeated retraining. We introduce the notion of equilibrium points, which are the fixed points of this repeated retraining procedure, and provide sufficient conditions for their existence and uniqueness. To find equilibrium points, we develop deterministic and stochastic primal-dual algorithms and demonstrate their convergence with constant step-size in the former and polynomial decay step-size schedule in the latter. By modeling errors emerging from a stochastic gradient estimator as sub-Weibull random variables, we provide error bounds in expectation and in high probability that hold for each iteration. Without additional knowledge of the distributional map, computing saddle points is intractable. Thus we propose a condition on the distributional map--which we call opposing mixture dominance--that ensures that the objective is strongly-convex-strongly-concave. Finally, we demonstrate that derivative-free algorithms with a single function evaluation are capable of approximating saddle points

artificial intelligence, equilibrium point, machine learning, (19 more...)

2201.02313

Country:

North America > United States > Colorado (0.04)
Europe (0.04)
Oceania > Australia > New South Wales > Sydney (0.04)

Genre: Research Report (0.82)

Industry:

Transportation > Ground > Road (0.94)
Transportation > Electric Vehicle (0.68)
Automobiles & Trucks (0.68)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.34)

Liang, Jingwei, Poon, Clarice

Screening for Sparse Online Learning

arXiv.org Artificial IntelligenceNov-11-2022

Sparsity promoting regularizers are widely used to impose low-complexity structure (e.g. l1-norm for sparsity) to the regression coefficients of supervised learning. In the realm of deterministic optimization, the sequence generated by iterative algorithms (such as proximal gradient descent) exhibit "finite activity identification", namely, they can identify the low-complexity structure in a finite number of iterations. However, most online algorithms (such as proximal stochastic gradient descent) do not have the property owing to the vanishing step-size and non-vanishing variance. In this paper, by combining with a screening rule, we show how to eliminate useless features of the iterates generated by online algorithms, and thereby enforce finite activity identification. One consequence is that when combined with any convergent online algorithm, sparsity properties imposed by the regularizer can be exploited for computational gains. Numerically, significant acceleration can be obtained.

artificial intelligence, machine learning, sparse online learning, (1 more...)

2101.06982

Genre: Research Report (0.40)

Industry: Education > Educational Setting > Online (0.40)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.73)

Chi, Jocelyn T., Needell, Deanna

Sketched Gaussian Model Linear Discriminant Analysis via the Randomized Kaczmarz Method

arXiv.org Machine LearningNov-10-2022

We harness a least squares formulation and mobilize the stochastic gradient descent framework. Therefore, we obtain a randomized classifier with performance that is very comparable to that of full data LDA while requiring access to only one row of the training data at a time. We present convergence guarantees for the sketched predictions on new data within a fixed number of iterations. These guarantees account for both the Gaussian modeling assumptions on the data and algorithmic randomness from the sketching procedure. Finally, we demonstrate performance with varying step-sizes and numbers of iterations. Our numerical experiments demonstrate that sketched LDA can offer a very viable alternative to full data LDA when the data may be too large for full data analysis.

algorithm 3, artificial intelligence, machine learning, (16 more...)

arXiv.org Machine Learning

2211.05749

Country: North America > United States > California (0.28)

Genre: Research Report (0.50)

Industry: Health & Medicine (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.55)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Discriminant Analysis (0.42)

arXiv.org Artificial IntelligenceNov-9-2022

Finding Second-Order Stationary Points in Nonconvex-Strongly-Concave Minimax Optimization

Luo, Luo, Li, Yujun, Chen, Cheng

We study the smooth minimax optimization problem $\min_{\bf x}\max_{\bf y} f({\bf x},{\bf y})$, where $f$ is $\ell$-smooth, strongly-concave in ${\bf y}$ but possibly nonconvex in ${\bf x}$. Most of existing works focus on finding the first-order stationary points of the function $f({\bf x},{\bf y})$ or its primal function $P({\bf x})\triangleq \max_{\bf y} f({\bf x},{\bf y})$, but few of them focus on achieving second-order stationary points. In this paper, we propose a novel approach for minimax optimization, called Minimax Cubic Newton (MCN), which could find an $\big(\varepsilon,\kappa^{1.5}\sqrt{\rho\varepsilon}\,\big)$-second-order stationary point of $P({\bf x})$ with calling ${\mathcal O}\big(\kappa^{1.5}\sqrt{\rho}\varepsilon^{-1.5}\big)$ times of second-order oracles and $\tilde{\mathcal O}\big(\kappa^{2}\sqrt{\rho}\varepsilon^{-1.5}\big)$ times of first-order oracles, where $\kappa$ is the condition number and $\rho$ is the Lipschitz continuous constant for the Hessian of $f({\bf x},{\bf y})$. In addition, we propose an inexact variant of MCN for high-dimensional problems to avoid calling expensive second-order oracles. Instead, our method solves the cubic sub-problem inexactly via gradient descent and matrix Chebyshev expansion. This strategy still obtains the desired approximate second-order stationary point with high probability but only requires $\tilde{\mathcal O}\big(\kappa^{1.5}\ell\varepsilon^{-2}\big)$ Hessian-vector oracle calls and $\tilde{\mathcal O}\big(\kappa^{2}\sqrt{\rho}\varepsilon^{-1.5}\big)$ first-order oracle calls. To the best of our knowledge, this is the first work that considers the non-asymptotic convergence behavior of finding second-order stationary points for minimax problems without the convex-concave assumptions.

algorithm, artificial intelligence, machine learning, (17 more...)

2110.04814

Country:

Asia > Middle East > Jordan (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
Asia > Singapore (0.04)
Asia > China > Shanghai > Shanghai (0.04)

Genre: Research Report > Promising Solution (0.34)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Search (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.34)

arXiv.org Artificial IntelligenceNov-9-2022

Accelerating Parallel Stochastic Gradient Descent via Non-blocking Mini-batches

He, Haoze, Dube, Parijat

SOTA decentralized SGD algorithms can overcome the bandwidth bottleneck at the parameter server by using communication collectives like Ring All-Reduce for synchronization. While the parameter updates in distributed SGD may happen asynchronously there is still a synchronization barrier to make sure that the local training epoch at every learner is complete before the learners can advance to the next epoch. The delays in waiting for the slowest learners(stragglers) remain to be a problem in the synchronization steps of these state-of-the-art decentralized frameworks. In this paper, we propose the (de)centralized Non-blocking SGD (Non-blocking SGD) which can address the straggler problem in a heterogeneous environment. The main idea of Non-blocking SGD is to split the original batch into mini-batches, then accumulate the gradients and update the model based on finished mini-batches. The Non-blocking idea can be implemented using decentralized algorithms including Ring All-reduce, D-PSGD, and MATCHA to solve the straggler problem. Moreover, using gradient accumulation to update the model also guarantees convergence and avoids gradient staleness. Run-time analysis with random straggler delays and computational efficiency/throughput of devices is also presented to show the advantage of Non-blocking SGD. Experiments on a suite of datasets and deep learning networks validate the theoretical analyses and demonstrate that Non-blocking SGD speeds up the training and fastens the convergence. Compared with the state-of-the-art decentralized asynchronous algorithms like D-PSGD and MACHA, Non-blocking SGD takes up to 2x fewer time to reach the same training loss in a heterogeneous environment.

algorithm, artificial intelligence, machine learning, (15 more...)

2211.00889

Country: North America > United States > New York (0.04)

Genre: Research Report (0.40)

Industry: Education (0.34)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Liao, Chonghua, Zheng, Yanan, Yang, Zhilin

Zero-Label Prompt Selection

Natural language prompts have been shown to facilitate cross-task generalization for large language models. However, with no or limited labeled examples, the cross-task performance is highly sensitive to the choice of prompts, while selecting a high-performing prompt is challenging given the scarcity of labels. To address the issue, we propose a Zero-Label Prompt Selection (ZPS) method that selects prompts without any labeled data or gradient update. Specifically, given the candidate human-written prompts for a task, ZPS labels a set of unlabeled data with a prompt ensemble and uses the pseudo-labels for prompt selection. Experiments show that ZPS improves over prior methods by a sizeable margin in zero-label performance. We also extend ZPS to a few-shot setting and show its advantages over strong baselines such as prompt tuning and model tuning. Recently, extensive studies have shown that large language models (LLMs) have promising performance for few-shot learning (Brown et al., 2020; Zhao et al., 2021; Schick & Schütze, 2021; Gao et al., 2021), and they even show strong generalization abilities to new tasks without any annotated data (Brown et al., 2020; Wei et al., 2021; Sanh et al., 2021). Different from conventional fine-tuning methods that require expensive parameter updates for each downstream task, prompts are employed to provide in-context information or task instructions, which is helpful for guiding models to perform each task. Manually-written prompts are often used to specify the task and unify the format of inputs. However, the performance of different prompts during evaluation can vary from near state-of-the-art to random guess; e.g., using a non-optimal prompt can cause a performance drop of up to 60 points on the CB task (Zhao et al., 2021).

computational linguistic, large language model, machine learning, (16 more...)

2211.04668

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
Asia > China > Shanghai > Shanghai (0.04)
Asia > China > Guangxi Province > Nanning (0.04)
(4 more...)

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.34)

Harchaoui, Zaid, Oh, Sewoong, Pal, Soumik, Somani, Raghav, Tripathi, Raghavendra

Stochastic optimization on matrices and a graphon McKean-Vlasov limit

We consider stochastic gradient descents on the space of large symmetric matrices of suitable functions that are invariant under permuting the rows and columns using the same permutation. We establish deterministic limits of these random curves as the dimensions of the matrices go to infinity while the entries remain bounded. Under a "small noise" assumption the limit is shown to be the gradient flow of functions on graphons whose existence was established in [OPST21]. We also consider limits of stochastic gradient descents with added properly scaled reflected Brownian noise. The limiting curve of graphons is characterized by a family of stochastic differential equations with reflections and can be thought of as an extension of the classical McKean-Vlasov limit for interacting diffusions to the graphon setting. The proofs introduce a family of infinite-dimensional exchangeable arrays of reflected diffusions and a novel notion of propagation of chaos for large matrices of diffusions converging to such arrays in a suitable sense.

artificial intelligence, equation, machine learning, (19 more...)

2210.00422

Country:

North America > United States > Washington > King County > Seattle (0.14)
North America > United States > New York (0.04)
Europe > Switzerland > Zürich > Zürich (0.04)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.94)

Kumar, Sachin, Paria, Biswajit, Tsvetkov, Yulia

Gradient-Based Constrained Sampling from Language Models

Large pretrained language models generate fluent text but are notoriously hard to controllably sample from. In this work, we study constrained sampling from such language models: generating text that satisfies user-defined constraints, while maintaining fluency and the model's performance in a downstream task. We propose MuCoLa -- a sampling procedure that combines the log-likelihood of the language model with arbitrary (differentiable) constraints in a single energy function, and then generates samples in a non-autoregressive manner. Specifically, it initializes the entire output sequence with noise and follows a Markov chain defined by Langevin Dynamics using the gradients of the energy function. We evaluate MuCoLa on text generation with soft and hard constraints as well as their combinations obtaining significant improvements over competitive baselines for toxicity avoidance, sentiment control, and keyword-guided generation.

artificial intelligence, machine learning, natural language, (19 more...)

2205.12558

Country:

North America > United States > Washington > King County > Seattle (0.28)
North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.14)
North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
(21 more...)

Genre: Research Report > New Finding (0.46)

Industry:

Leisure & Entertainment > Sports > Soccer (1.00)
Health & Medicine (0.94)
Government > Regional Government > North America Government > United States Government (0.67)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Constraint-Based Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.46)

Black Box Lie Group Preconditioners for SGD

Li, Xilin

A matrix free and a low rank approximation preconditioner are proposed to accelerate the convergence of stochastic gradient descent (SGD) by exploiting curvature information sampled from Hessian-vector products or finite differences of parameters and gradients similar to the BFGS algorithm. Both preconditioners are fitted with an online updating manner minimizing a criterion that is free of line search and robust to stochastic gradient noise, and further constrained to be on certain connected Lie groups to preserve their corresponding symmetry or invariance, e.g., orientation of coordinates by the connected general linear group with positive determinants. The Lie group's equivariance property facilitates preconditioner fitting, and its invariance property saves any need of damping, which is common in second-order optimizers, but difficult to tune. The learning rate for parameter updating and step size for preconditioner fitting are naturally normalized, and their default values work well in most situations.

artificial intelligence, machine learning, preconditioner, (14 more...)

2211.04422

Country:

North America > United States > New York > Tompkins County > Ithaca (0.04)
North America > United States > Louisiana > Orleans Parish > New Orleans (0.04)
North America > United States > California > Santa Clara County > Sunnyvale (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)

Genre: Research Report (0.65)

Industry: Transportation > Air (0.41)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.95)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.76)