AITopics

2206.15387

Genre: Research Report > New Finding (0.68)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.87)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.68)

arXiv.org Artificial IntelligenceFeb-27-2023

Federated Learning under Distributed Concept Drift

Jothimurugesan, Ellango, Hsieh, Kevin, Wang, Jianyu, Joshi, Gauri, Gibbons, Phillip B.

Federated Learning (FL) under distributed concept drift is a largely unexplored area. Although concept drift is itself a well-studied phenomenon, it poses particular challenges for FL, because drifts arise staggered in time and space (across clients). To the best of our knowledge, this work is the first to explicitly study data heterogeneity in both dimensions. We first demonstrate that prior solutions to drift adaptation that use a single global model are ill-suited to staggered drifts, necessitating multiple-model solutions. We identify the problem of drift adaptation as a time-varying clustering problem, and we propose two new clustering algorithms for reacting to drifts based on local drift detection and hierarchical clustering. Empirical evaluation shows that our solutions achieve significantly higher accuracy than existing baselines, and are comparable to an idealized algorithm with oracle knowledge of the ground-truth clustering of clients to concepts at each time step.

algorithm, artificial intelligence, machine learning, (17 more...)

2206.00799

Country: Europe (0.67)

Genre: Research Report > Experimental Study (0.48)

Industry:

Information Technology (1.00)
Health & Medicine (0.92)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

arXiv.org Artificial IntelligenceJul-10-2021

DualVGR: A Dual-Visual Graph Reasoning Unit for Video Question Answering

Wang, Jianyu, Bao, Bing-Kun, Xu, Changsheng

Video question answering is a challenging task, which requires agents to be able to understand rich video contents and perform spatial-temporal reasoning. However, existing graph-based methods fail to perform multi-step reasoning well, neglecting two properties of VideoQA: (1) Even for the same video, different questions may require different amount of video clips or objects to infer the answer with relational reasoning; (2) During reasoning, appearance and motion features have complicated interdependence which are correlated and complementary to each other. Based on these observations, we propose a Dual-Visual Graph Reasoning Unit (DualVGR) which reasons over videos in an end-to-end fashion. The first contribution of our DualVGR is the design of an explainable Query Punishment Module, which can filter out irrelevant visual features through multiple cycles of reasoning. The second contribution is the proposed Video-based Multi-view Graph Attention Network, which captures the relations between appearance and motion features. Our DualVGR network achieves state-of-the-art performance on the benchmark MSVD-QA and SVQA datasets, and demonstrates competitive results on benchmark MSRVTT-QA datasets. Our code is available at https://github.com/MMIR/DualVGR-VideoQA.

deep learning, neural network, reasoning, (19 more...)

doi: 10.1109/TMM.2021.3097171

2107.04768

Country:

Asia > China (0.46)
North America > United States > California > San Diego County (0.14)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Question Answering (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.68)

arXiv.org Machine LearningJun-4-2021

Local Adaptivity in Federated Learning: Convergence and Consistency

Wang, Jianyu, Xu, Zheng, Garrett, Zachary, Charles, Zachary, Liu, Luyang, Joshi, Gauri

The federated learning (FL) framework trains a machine learning model using decentralized data stored at edge client devices by periodically aggregating locally trained models. Popular optimization algorithms of FL use vanilla (stochastic) gradient descent for both local updates at clients and global updates at the aggregating server. Recently, adaptive optimization methods such as AdaGrad have been studied for server updates. However, the effect of using adaptive optimization methods for local updates at clients is not yet understood. We show in both theory and practice that while local adaptive methods can accelerate convergence, they can cause a non-vanishing solution bias, where the final converged solution may be different from the stationary point of the global objective function. We propose correction techniques to overcome this inconsistency and complement the local adaptive methods for FL. Extensive experiments on realistic federated training tasks show that the proposed algorithms can achieve faster convergence and higher test accuracy than the baselines without local adaptivity.

artificial intelligence, optimization problem, optimizer, (16 more...)

2106.02305

Country: North America > United States (0.14)

Genre: Research Report > New Finding (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.88)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.54)

arXiv.org Machine LearningOct-2-2020

Client Selection in Federated Learning: Convergence Analysis and Power-of-Choice Selection Strategies

Cho, Yae Jee, Wang, Jianyu, Joshi, Gauri

Federated learning is a distributed optimization paradigm that enables a large number of resource-limited client nodes to cooperatively train a model without data sharing. Several works have analyzed the convergence of federated learning by accounting of data heterogeneity, communication and computation limitations, and partial client participation. However, they assume unbiased client participation, where clients are selected at random or in proportion of their data sizes. In this paper, we present the first convergence analysis of federated optimization for biased client selection strategies, and quantify how the selection bias affects convergence speed. We reveal that biasing client selection towards clients with higher local loss achieves faster error convergence. Using this insight, we propose Power-of-Choice, a communication- and computation-efficient client selection framework that can flexibly span the trade-off between convergence speed and solution bias. Our experiments demonstrate that Power-of-Choice strategies converge up to 3 $\times$ faster and give $10$% higher test accuracy than the baseline random selection.

federated learning, neural network, selection strategy, (12 more...)

2010.01243

Country:

North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.14)
North America > United States > California > Alameda County > Berkeley (0.14)

Genre: Research Report (0.64)

Industry: Information Technology (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.67)

arXiv.org Machine LearningJul-15-2020

Tackling the Objective Inconsistency Problem in Heterogeneous Federated Optimization

Wang, Jianyu, Liu, Qinghua, Liang, Hao, Joshi, Gauri, Poor, H. Vincent

In federated optimization, heterogeneity in the clients' local datasets and computation speeds results in large variations in the number of local updates performed by each client in each communication round. Naive weighted aggregation of such models causes objective inconsistency, that is, the global model converges to a stationary point of a mismatched objective function which can be arbitrarily different from the true objective. This paper provides a general framework to analyze the convergence of federated heterogeneous optimization algorithms. It subsumes previously proposed methods such as FedAvg and FedProx and provides the first principled understanding of the solution bias and the convergence slowdown due to objective inconsistency. Using insights from this analysis, we propose Fed-Nova, a normalized averaging method that eliminates objective inconsistency while preserving fast error convergence.

artificial intelligence, fedavg, optimization problem, (16 more...)

2007.07481

Country: North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.14)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.68)
Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.48)

arXiv.org Artificial IntelligenceJun-19-2020

Band-limited Soft Actor Critic Model

Campo, Miguel, Chen, Zhengxing, Kung, Luke, Virochsiri, Kittipat, Wang, Jianyu

Soft Actor Critic (SAC) algorithms show remarkable performance in complex simulated environments. A key element of SAC networks is entropy regularization, which prevents the SAC actor from optimizing against fine grained features, oftentimes transient, of the state-action value function. This results in better sample efficiency during early training. We take this idea one step further by artificially bandlimiting the target critic spatial resolution through the addition of a convolutional filter. We derive the closed form solution in the linear case and show that bandlimiting reduces the interdependency between the low and high frequency components of the state-action value approximation, allowing the critic to learn faster. In experiments, the bandlimited SAC outperformed the classic twin-critic SAC in a number of Gym environments, and displayed more stability in returns. We derive novel insights about SAC by adding a stochastic noise disturbance, a technique that is increasingly being used to learn robust policies that transfer well to the real world counterparts.

deep learning, frequency component, neural network, (16 more...)

2006.11431

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Robots (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)

arXiv.org Machine LearningOct-24-2019

Deep topic modeling by multilayer bootstrap network and lasso

Wang, Jianyu, Zhang, Xiao-Lei

It is originally formulated as a hierarchical generative model: a document is generated from a mixture of topics, and a word in the document is generated by first choosing a topic from a document-specific distribution, and then choosing the word from the topic-specific distribution. The main difficulty of topic modeling is the optimization problem, which is NPhard in the worst case due to the intractability of the posterior inference. Existing methods aim to find approximate solutions to the difficult optimization problem, which falls into the framework of matrix factorization. Matrix factorization based topic modeling maps documents into a low-dimensional semantic space by decomposing the documents into a weighted combination of a set of topic distributions: D CW where D (:,d) represents the d -th document which is a column vector over a set of words with a vocabulary size of v, C (:,g) denotes the g -th topic which is a probability mass function over the vocabulary, and W ( g,d) denotes the probability of the g -th topic in the d -th document.

assumption, deep learning, neural network, (21 more...)

1910.10953

Country: Asia (0.95)

Genre: Research Report (0.40)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

arXiv.org Machine LearningOct-1-2019

SlowMo: Improving Communication-Efficient Distributed SGD with Slow Momentum

Wang, Jianyu, Tantia, Vinayak, Ballas, Nicolas, Rabbat, Michael

A BSTRACT Distributed optimization is essential for training large models on large datasets. Multiple approaches have been proposed to reduce the communication overhead in distributed training, such as synchronizing only after performing multiple local SGD steps, and decentralized methods ( e.g., using gossip algorithms) to decouple communications among workers. Although these methods run faster than A LLR EDUCEbased methods, which use blocking communication before every update, the resulting models may be less accurate after the same number of updates. Inspired by the BMUF method of Chen & Huo (2016), we propose a slow momentum (S LOWM O) framework, where workers periodically synchronize and perform a momentum update, after multiple iterations of a base optimization algorithm. Experiments on image classification and machine translation tasks demonstrate that S LOWM O consistently yields improvements in optimization and generalization performance relative to the base optimizer, even when the additional overhead is amortized over many updates so that the S LOWM O runtime is on par with that of the base optimizer. We provide theoretical convergence guarantees showing that S LOWM O converges to a stationary point of smooth non-convex losses. Since BMUF is a particular instance of the S LOWM O framework, our results also correspond to the first theoretical convergence guarantees for BMUF. 1 I NTRODUCTION Distributed optimization (Chen et al., 2016; Goyal et al., 2017) is essential for training large models on large datasets (Radford et al., 2019; Liu et al., 2019; Mahajan et al., 2018b). Currently, the most widely-used approaches have workers compute small mini-batch gradients locally, in parallel, and then aggregate these using a blocking communication primitive, A LLR EDUCE, before taking an optimizer step. Communication overhead is a major issue limiting the scaling of this approach, since A LLR EDUCE must complete before every step and blocking communications are sensitive to stragglers (Dutta et al., 2018; Ferdinand et al., 2019). Multiple complementary approaches have recently been investigated to reduce or hide communication overhead. Decentralized training (Jiang et al., 2017; Lian et al., 2017; 2018; Assran et al., 2019) reduces idling due to blocking and stragglers by employing approximate gradient aggregation ( e.g., via gossip or distributed averaging). Approaches such as Local SGD reduce the frequency of communication by having workers perform multiple updates between each round of communication (McDonald et al., 2010; McMahan et al., 2017; Zhou & Cong, 2018; Stich, 2019; Y u et al., 2019b). It is also possible to combine decentralized algorithms with Local SGD (Wang & Joshi, Work performed while doing an internship at Facebook AI Research. 1 arXiv:1910.00643v1

algorithm, neural network, optimization problem, (18 more...)

1910.00643

Country: North America > Canada (0.28)

Genre: Research Report > New Finding (0.48)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.68)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.48)
Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.48)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.47)

arXiv.org Machine LearningMay-22-2019

MATCHA: Speeding Up Decentralized SGD via Matching Decomposition Sampling

Wang, Jianyu, Sahu, Anit Kumar, Yang, Zhouyi, Joshi, Gauri, Kar, Soummya

The trade-off between convergence error and communication delays in decentralized stochastic gradient descent~(SGD) is dictated by the sparsity of the inter-worker communication graph. In this paper, we propose MATCHA, a decentralized SGD method where we use matching decomposition sampling of the base graph to parallelize inter-worker information exchange so as to significantly reduce communication delay. At the same time, under standard assumptions for any general topology, in spite of the significant reduction of the communication delay, MATCHA maintains the same convergence rate as that of the state-of-the-art in terms of epochs. Experiments on a suite of datasets and deep neural networks validate the theoretical analysis and demonstrate the effectiveness of the proposed scheme as far as reducing communication delays is concerned.

deep learning, matcha, neural network, (17 more...)

1905.09435

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.67)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.55)