AITopics

Country: North America > Canada (0.04)

Technology: Information Technology > Artificial Intelligence > Machine Learning (0.69)

Neural Information Processing SystemsOct-2-2025, 21:11:53 GMT

Supplementary materials for Paper " Bandit Samplers for Training Graph Neural Networks "

We show the convergences on validation in terms of timing (seconds) in Figure 1 and Figure 2. Basically, our algorithms converge to much better results in nearly same duration compared with Note that we cannot complete the training of AS-GA T on Reddit because of memory issues. Note that the comparisons of timing between "graph sampling" and "layer sampling" paradigms have As a result, we do not compare the timing with "graph sampling" approaches. That is, graph sampling approaches are designed for graph data that all vertices have labels. To summarize, the "layer sampling" approaches are more flexible and general compared with "graph sampling" Before we give the proof of Theorem 1, we first prove the following Lemma 1 that will be used later.

artificial intelligence, machine learning, training graph neural network, (13 more...)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.41)

arXiv.org Machine LearningFeb-19-2025

Understanding SGD with Exponential Moving Average: A Case Study in Linear Regression

Li, Xuheng, Gu, Quanquan

Exponential moving average (EMA) has recently gained significant popularity in training modern deep learning models, especially diffusion-based generative models. However, there have been few theoretical results explaining the effectiveness of EMA. In this paper, to better understand EMA, we establish the risk bound of online SGD with EMA for high-dimensional linear regression, one of the simplest overparameterized learning tasks that shares similarities with neural networks. Our results indicate that (i) the variance error of SGD with EMA is always smaller than that of SGD without averaging, and (ii) unlike SGD with iterate averaging from the beginning, the bias error of SGD with EMA decays exponentially in every eigen-subspace of the data covariance matrix. Additionally, we develop proof techniques applicable to the analysis of a broad class of averaging schemes.

excess risk, inequality hold, sgd, (16 more...)

2502.14123

Country: North America > United States > California > Los Angeles County > Los Angeles (0.28)

Genre: Research Report (0.70)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Regression (0.71)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.54)

arXiv.org Machine LearningNov-23-2023

Risk Bounds of Accelerated SGD for Overparameterized Linear Regression

Li, Xuheng, Deng, Yihe, Wu, Jingfeng, Zhou, Dongruo, Gu, Quanquan

Accelerated stochastic gradient descent (ASGD) is a workhorse in deep learning and often achieves better generalization performance than SGD. However, existing optimization theory can only explain the faster convergence of ASGD, but cannot explain its better generalization. In this paper, we study the generalization of ASGD for overparameterized linear regression, which is possibly the simplest setting of learning with overparameterization. We establish an instance-dependent excess risk bound for ASGD within each eigen-subspace of the data covariance matrix. Our analysis shows that (i) ASGD outperforms SGD in the subspace of small eigenvalues, exhibiting a faster rate of exponential decay for bias error, while in the subspace of large eigenvalues, its bias error decays slower than SGD; and (ii) the variance error of ASGD is always larger than that of SGD. Our result suggests that ASGD can outperform SGD when the difference between the initialization and the true weight vector is mostly confined to the subspace of small eigenvalues. Additionally, when our analysis is specialized to linear regression in the strongly convex setting, it yields a tighter bound for bias error than the best-known result.

artificial intelligence, inequality hold, machine learning, (16 more...)

2311.14222

Country:

North America > United States > California > Los Angeles County > Los Angeles (0.28)
North America > United States > California > Alameda County > Berkeley (0.14)
North America > United States > Indiana > Monroe County > Bloomington (0.04)
(2 more...)

Genre: Research Report > New Finding (0.68)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Regression (0.81)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.54)

arXiv.org Machine LearningJun-11-2020

Bandit Samplers for Training Graph Neural Networks

Liu, Ziqi, Wu, Zhengwei, Zhang, Zhiqiang, Zhou, Jun, Yang, Shuang, Song, Le, Qi, Yuan

Several sampling algorithms with variance reduction have been proposed for accelerating the training of Graph Convolution Networks (GCNs). However, due to the intractable computation of optimal sampling distribution, these sampling algorithms are suboptimal for GCNs and are not applicable to more general graph neural networks (GNNs) where the message aggregator contains learned weights rather than fixed weights, such as Graph Attention Networks (GAT). The fundamental reason is that the embeddings of the neighbors or learned weights involved in the optimal sampling distribution are changing during the training and not known a priori, but only partially observed when sampled, thus making the derivation of an optimal variance reduced samplers non-trivial. In this paper, we formulate the optimization of the sampling variance as an adversary bandit problem, where the rewards are related to the node embeddings and learned weights, and can vary constantly. Thus a good sampler needs to acquire variance information about more neighbors (exploration) while at the same time optimizing the immediate sampling variance (exploit). We theoretically show that our algorithm asymptotically approaches the optimal variance within a factor of 3. We show the efficiency and effectiveness of our approach on multiple datasets.

artificial intelligence, machine learning, variance, (15 more...)

2006.05806

Country: North America > United States > California > Santa Clara County > Palo Alto (0.04)

Genre: Research Report (0.82)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

Salehi, Farnood, Celis, L. Elisa, Thiran, Patrick

Stochastic Optimization with Bandit Sampling

arXiv.org Machine LearningAug-9-2017

Many stochastic optimization algorithms work by estimating the gradient of the cost function on the fly by sampling datapoints uniformly at random from a training set. However, the estimator might have a large variance, which inadvertently slows down the convergence rate of the algorithms. One way to reduce this variance is to sample the datapoints from a carefully selected non-uniform distribution. In this work, we propose a novel non-uniform sampling approach that uses the multi-armed bandit framework. Theoretically, we show that our algorithm asymptotically approximates the optimal variance within a factor of 3. Empirically, we show that using this datapoint-selection technique results in a significant reduction in the convergence time and variance of several stochastic optimization algorithms such as SGD, SVRG and SAGA. This approach for sampling datapoints is general, and can be used in conjunction with any algorithm that uses an unbiased gradient estimation -- we expect it to have broad applicability beyond the specific examples explored in this work.

algorithm, artificial intelligence, machine learning, (17 more...)

1708.02544

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)

Geiger, Davi, Rudra, Archisman, Maloney, Laurance T.

Features as Sufficient Statistics

Neural Information Processing SystemsDec-31-1998

An image is often represented by a set of detected features. We get an enormous compression by representing images in this way. Furthermore, we get a representation which is little affected by small amounts of noise in the image. However, features are typically chosen in an ad hoc manner.

information, statistics, variance, (15 more...)

Country: North America > United States > New York > New York County > New York City (0.04)

Technology:

Information Technology > Artificial Intelligence > Natural Language (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models (0.47)

Geiger, Davi, Rudra, Archisman, Maloney, Laurance T.

Features as Sufficient Statistics

Neural Information Processing SystemsDec-31-1998

information, statistics, variance, (15 more...)

Country: North America > United States > New York > New York County > New York City (0.04)

Technology:

Information Technology > Artificial Intelligence > Natural Language (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models (0.47)

Geiger, Davi, Rudra, Archisman, Maloney, Laurance T.

Features as Sufficient Statistics

Neural Information Processing SystemsDec-31-1998

An image is often represented by a set of detected features. We get an enormous compression by representing images in this way. Furthermore, weget a representation which is little affected by small amounts of noise in the image. However, features are typically chosen in an ad hoc manner.

information, machine learning, natural language, (18 more...)