AITopics

Modern neural networks are often quite wide, causing large memory and computation costs. It is thus of great interest to train a narrower network. However, training narrow neural nets remains a challenging task. We ask two theoretical questions: Can narrow networks have as strong expressivity as wide ones? If so, does the loss function exhibit a benign optimization landscape? In this work, we provide partially affirmative answers to both questions for 1-hidden-layer networks with fewer than $n$ (sample size) neurons when the activation is smooth. First, we prove that as long as the width $m \geq 2n/d$ (where $d$ is the input dimension), its expressivity is strong, i.e., there exists at least one global minimizer with zero training loss. Second, we identify a nice local region with no local-min or saddle points. Nevertheless, it is not clear whether gradient descent can stay in this nice region. Third, we consider a constrained optimization formulation where the feasible region is the nice local region, and prove that every KKT point is a nearly global minimizer. It is expected that projected gradient methods converge to KKT points under mild technical conditions, but we leave the rigorous convergence analysis to future work. Thorough numerical results show that projected gradient methods on this constrained formulation significantly outperform SGD for training narrow neural nets.

artificial intelligence, machine learning, neural network, (17 more...)

2210.12001

Country:

Asia > Myanmar > Tanintharyi Region > Dawei (0.04)
Asia > China > Guangdong Province > Shenzhen (0.04)
Asia > China > Hong Kong (0.04)
(2 more...)

Genre: Research Report > New Finding (0.87)

Industry: Leisure & Entertainment > Sports (0.34)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.48)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

A sharp uniform-in-time error estimate for Stochastic Gradient Langevin Dynamics

Li, Lei, Wang, Yuliang

The Stochastic Gradient Langevin Dynamics (SGLD) [49], first proposed by Welling and Teh, has drawn great attention of researchers when dealing with optimization or sampling tasks[2, 33, 40]. As a samplingalgorithm, SGLD canbe viewed asa"randombatch"version of the Unadjusted Langevin Algorithm (ULA), which is the Euler-Maruyama discretization of the Langevin diffusion, a stochastic process converging to a target Gibbs' distribution under suitable settings. As an optimization algorithm, SGLD can be viewed as a variant of the classical Stochastic Gradient Descent (SGD) [44], by adding independent Gaussian noise in each iteration of SGD. At recent decades, SGD and its variants [44, 25, 11, 37] have received a great deal of attention when solving high-dimensional tasks, ranging from computer vision, natural language processing, to high dimensional sampling, statistical optimization, etc. Also much theoretical analysis for SGD has been done by former researchers, including loss landscape of SGD iteration [46, 47], its dynamical stability [50] and diffusion approximation [32, 21, 17]. The combination of the SGD algorithm and the Langevin diffusion, can improve the behavior of both methods: for SGD, by taking another independent diffusion term into consideration, though not converging to a fixed point, the algorithm may be able to admit better ergodic properties and obtain better performance near saddle points [26, 52]. Besides, the application of the methodology of random mini-batch to Langevin diffusion could result in some efficient methods that could reduce computational cost while preserving the dynamical and statistical properties. Examples include the SGLD algorithm we study in the paper and the random batch methods for interacting particle systems [22, 23].

artificial intelligence, inequality, machine learning, (15 more...)

2207.09304

Country:

Asia > China > Shanghai > Shanghai (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
Europe > France > Occitanie > Haute-Garonne > Toulouse (0.04)
Asia > Afghanistan > Parwan Province > Charikar (0.04)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Mathematical & Statistical Methods (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (1.00)

Zhang, Xianyang, Dawn, Trisha

Sequential Gradient Descent and Quasi-Newton's Method for Change-Point Analysis

One common approach to detecting change-points is minimizing a cost function over possible numbers and locations of change-points. The framework includes several well-established procedures, such as the penalized likelihood and minimum description length. Such an approach requires finding the cost value repeatedly over different segments of the data set, which can be time-consuming when (i) the data sequence is long and (ii) obtaining the cost value involves solving a non-trivial optimization problem. This paper introduces a new sequential method (SE) that can be coupled with gradient descent (SeGD) and quasi-Newton's method (SeN) to find the cost value effectively. The core idea is to update the cost value using the information from previous steps without re-optimizing the objective function. The new method is applied to change-point detection in generalized linear models and penalized regression. Numerical studies show that the new approach can be orders of magnitude faster than the Pruned Exact Linear Time (PELT) method without sacrificing estimation accuracy.

algorithm, artificial intelligence, machine learning, (16 more...)

2210.12235

Country: North America > United States > Texas > Brazos County > College Station (0.04)

Genre:

Research Report (0.65)
Overview (0.46)

Industry: Health & Medicine > Therapeutic Area (0.47)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.70)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Regression (0.48)

Bars, Batiste Le, Bellet, Aurélien, Tommasi, Marc, Lavoie, Erick, Kermarrec, Anne-Marie

Refined Convergence and Topology Learning for Decentralized SGD with Heterogeneous Data

One of the key challenges in decentralized and federated learning is to design algorithms that efficiently deal with highly heterogeneous data distributions across agents. In this paper, we revisit the analysis of the popular Decentralized Stochastic Gradient Descent algorithm (D-SGD) under data heterogeneity. We exhibit the key role played by a new quantity, called neighborhood heterogeneity, on the convergence rate of D-SGD. By coupling the communication topology and the heterogeneity, our analysis sheds light on the poorly understood interplay between these two concepts. We then argue that neighborhood heterogeneity provides a natural criterion to learn data-dependent topologies that reduce (and can even eliminate) the otherwise detrimental effect of data heterogeneity on the convergence time of D-SGD. For the important case of classification with label skew, we formulate the problem of learning such a good topology as a tractable optimization problem that we solve with a Frank-Wolfe algorithm. As illustrated over a set of simulated and real-world experiments, our approach provides a principled way to design a sparse topology that balances the convergence speed and the per-iteration communication costs of D-SGD under data heterogeneity.

artificial intelligence, machine learning, topology, (14 more...)

2204.04452

Country: Europe > Switzerland > Vaud > Lausanne (0.04)

Genre: Research Report > New Finding (0.68)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.54)

Davi, Caio, Braga-Neto, Ulisses

PSO-PINN: Physics-Informed Neural Networks Trained with Particle Swarm Optimization

Physics-informed neural networks (PINN) have recently emerged as a promising application of deep learning in a wide range of engineering and scientific problems based on partial differential equation (PDE) models. However, evidence shows that PINN training by gradient descent displays pathologies that often prevent convergence when solving PDEs with irregular solutions. In this paper, we propose the use of a particle swarm optimization (PSO) approach to train PINNs. The resulting PSO-PINN algorithm not only mitigates the undesired behaviors of PINNs trained with standard gradient descent but also presents an ensemble approach to PINN that affords the possibility of robust predictions with quantified uncertainty. We also propose PSO-BP-CD (PSO with Back-Propagation and Coefficient Decay), a hybrid PSO variant that combines swarm optimization with gradient descent, putting more weight on the latter as training progresses and the swarm zeros in on a good local optimum. Comprehensive experimental results show that PSO-PINN with the proposed PSO-BP-CD algorithm outperforms PINN ensembles trained with other PSO variants or with pure gradient descent.

artificial intelligence, evolutionary algorithm, machine learning, (17 more...)

2202.01943

Genre: Research Report > New Finding (0.48)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Evolutionary Systems (1.00)

TabNAS: Rejection Sampling for Neural Architecture Search on Tabular Datasets

Yang, Chengrun, Bender, Gabriel, Liu, Hanxiao, Kindermans, Pieter-Jan, Udell, Madeleine, Lu, Yifeng, Le, Quoc, Huang, Da

The best neural architecture for a given machine learning problem depends on many factors: not only the complexity and structure of the dataset, but also on resource constraints including latency, compute, energy consumption, etc. Neural architecture search (NAS) for tabular datasets is an important but under-explored problem. Previous NAS algorithms designed for image search spaces incorporate resource constraints directly into the reinforcement learning (RL) rewards. However, for NAS on tabular datasets, this protocol often discovers suboptimal architectures. This paper develops TabNAS, a new and more effective approach to handle resource constraints in tabular NAS using an RL controller motivated by the idea of rejection sampling. TabNAS immediately discards any architecture that violates the resource constraints without training or learning from that architecture. TabNAS uses a Monte-Carlo-based correction to the RL policy gradient update to account for this extra filtering step. Results on several tabular datasets demonstrate the superiority of TabNAS over previous reward-shaping methods: it finds better models that obey the constraints.

artificial intelligence, deep learning, machine learning, (18 more...)

2204.07615

Country: North America > United States > California > Santa Clara County > Palo Alto (0.04)

Genre: Research Report (0.64)

Industry: Energy (0.34)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.48)

Lanconelli, Alberto, Lauria, Christopher S. A.

A note on diffusion limits for stochastic gradient descent

In the machine learning literature stochastic gradient descent has recently been widely discussed for its purported implicit regularization properties. Much of the theory, that attempts to clarify the role of noise in stochastic gradient algorithms, has widely approximated stochastic gradient descent by a stochastic differential equation with Gaussian noise. We provide a novel rigorous theoretical justification for this practice that showcases how the Gaussianity of the noise arises naturally.

artificial intelligence, machine learning, stochastic gradient descent, (11 more...)

2210.11257

Country:

North America > United States > California > Los Angeles County > Los Angeles (0.14)
Europe > Italy > Emilia-Romagna > Metropolitan City of Bologna > Bologna (0.05)
North America > United States > New York (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)

Genre: Research Report (0.50)

Industry: Education (0.49)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Mathematical & Statistical Methods (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (1.00)

Learning to Reason with Neural Networks: Generalization, Unseen Data and Boolean Measures

Abbe, Emmanuel, Bengio, Samy, Cornacchia, Elisabetta, Kleinberg, Jon, Lotfi, Aryo, Raghu, Maithra, Zhang, Chiyuan

This paper considers the Pointer Value Retrieval (PVR) benchmark introduced in [ZRKB21], where a 'reasoning' function acts on a string of digits to produce the label. More generally, the paper considers the learning of logical functions with gradient descent (GD) on neural networks. It is first shown that in order to learn logical functions with gradient descent on symmetric neural networks, the generalization error can be lower-bounded in terms of the noise-stability of the target function, supporting a conjecture made in [ZRKB21]. It is then shown that in the distribution shift setting, when the data withholding corresponds to freezing a single feature (referred to as canonical holdout), the generalization error of gradient descent admits a tight characterization in terms of the Boolean influence for several relevant architectures. This is shown on linear models and supported experimentally on other models such as MLPs and Transformers. In particular, this puts forward the hypothesis that for such architectures and for learning logical functions such as PVR functions, GD tends to have an implicit bias towards low-degree representations, which in turn gives the Boolean influence for the generalization error under quadratic loss.

artificial intelligence, generalization error, machine learning, (20 more...)

2205.13647

Country: Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)

Genre: Research Report > New Finding (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.74)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)

Li, Xiao, Milzarek, Andre

A Unified Convergence Theorem for Stochastic Optimization Methods

arXiv.org Artificial IntelligenceOct-19-2022

In this work, we provide a fundamental unified convergence theorem used for deriving expected and almost sure convergence results for a series of stochastic optimization methods. Our unified theorem only requires to verify several representative conditions and is not tailored to any specific algorithm. As a direct application, we recover expected and almost sure convergence results of the stochastic gradient method (SGD) and random reshuffling (RR) under more general settings. Moreover, we establish new expected and almost sure convergence results for the stochastic proximal gradient method (prox-SGD) and stochastic model-based methods (SMM) for nonsmooth nonconvex optimization problems. These applications reveal that our unified theorem provides a plugin-type convergence analysis and strong convergence guarantees for a wide class of stochastic optimization methods.

artificial intelligence, convergence, machine learning, (17 more...)

2206.03907

Country:

Asia > China > Guangdong Province > Shenzhen (0.04)
North America > United States > Pennsylvania > Philadelphia County > Philadelphia (0.04)
Asia > China > Hong Kong (0.04)
(3 more...)

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.35)

Cyffers, Edwige, Even, Mathieu, Bellet, Aurélien, Massoulié, Laurent

Muffliato: Peer-to-Peer Privacy Amplification for Decentralized Optimization and Averaging

arXiv.org Artificial IntelligenceOct-19-2022

Decentralized optimization is increasingly popular in machine learning for its scalability and efficiency. Intuitively, it should also provide better privacy guarantees, as nodes only observe the messages sent by their neighbors in the network graph. But formalizing and quantifying this gain is challenging: existing results are typically limited to Local Differential Privacy (LDP) guarantees that overlook the advantages of decentralization. In this work, we introduce pairwise network differential privacy, a relaxation of LDP that captures the fact that the privacy leakage from a node $u$ to a node $v$ may depend on their relative position in the graph. We then analyze the combination of local noise injection with (simple or randomized) gossip averaging protocols on fixed and random communication graphs. We also derive a differentially private decentralized optimization algorithm that alternates between local gradient descent steps and gossip averaging. Our results show that our algorithms amplify privacy guarantees as a function of the distance between nodes in the graph, matching the privacy-utility trade-off of the trusted curator, up to factors that explicitly depend on the graph topology. Finally, we illustrate our privacy gains with experiments on synthetic and real-world datasets.

artificial intelligence, graph, machine learning, (16 more...)

2206.05091

Country:

North America > United States > Pennsylvania > Philadelphia County > Philadelphia (0.04)
North America > United States > Indiana > Monroe County > Bloomington (0.04)
Europe > Hungary > Hajdú-Bihar County > Debrecen (0.04)
(2 more...)

Genre: Research Report > New Finding (1.00)

Industry: Information Technology > Security & Privacy (1.00)

Technology:

Information Technology > Security & Privacy (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.34)