Goto

Collaborating Authors

 sup 1



Asymptotic and Non-Asymptotic Convergence Analysis of AdaGrad for Non-Convex Optimization via Novel Stopping Time-based Analysis

Jin, Ruinan, Wang, Xiaoyu, Wang, Baoxiang

arXiv.org Machine Learning

Adaptive optimizers have emerged as powerful tools in deep learning, dynamically adjusting the learning rate based on iterative gradients. These adaptive methods have significantly succeeded in various deep learning tasks, outperforming stochastic gradient descent (SGD). However, although AdaGrad is a cornerstone adaptive optimizer, its theoretical analysis is inadequate in addressing asymptotic convergence and non-asymptotic convergence rates on non-convex optimization. This study aims to provide a comprehensive analysis and complete picture of AdaGrad. We first introduce a novel stopping time technique from probabilistic theory to establish stability for the norm version of AdaGrad under milder conditions. We further derive two forms of asymptotic convergence: almost sure and mean-square. Furthermore, we demonstrate the near-optimal non-asymptotic convergence rate measured by the average-squared gradients in expectation, which is rarely explored and stronger than the existing high-probability results, under the mild assumptions. The techniques developed in this work are potentially independent of interest for future research on other adaptive stochastic algorithms.


Distributed Gradient Descent for Functional Learning

Yu, Zhan, Fan, Jun, Zhou, Ding-Xuan

arXiv.org Artificial Intelligence

In recent years, different types of distributed learning schemes have received increasing attention for their strong advantages in handling large-scale data information. In the information era, to face the big data challenges which stem from functional data analysis very recently, we propose a novel distributed gradient descent functional learning (DGDFL) algorithm to tackle functional data across numerous local machines (processors) in the framework of reproducing kernel Hilbert space. Based on integral operator approaches, we provide the first theoretical understanding of the DGDFL algorithm in many different aspects in the literature. On the way of understanding DGDFL, firstly, a data-based gradient descent functional learning (GDFL) algorithm associated with a single-machine model is proposed and comprehensively studied. Under mild conditions, confidence-based optimal learning rates of DGDFL are obtained without the saturation boundary on the regularity index suffered in previous works in functional regression. We further provide a semi-supervised DGDFL approach to weaken the restriction on the maximal number of local machines to ensure optimal rates. To our best knowledge, the DGDFL provides the first distributed iterative training approach to functional learning and enriches the stage of functional data analysis.


Event-triggered privacy preserving consensus control with edge-based additive noise

Liang, Limei, Ding, Ruiqi, Liu, Shuai

arXiv.org Artificial Intelligence

In this article, we investigate the distributed privacy preserving weighted consensus control problem for linear continuous-time multi-agent systems under the event-triggering communication mode. A novel event-triggered privacy preserving consensus scheme is proposed, which can be divided into three phases. First, for each agent, an event-triggered mechanism is designed to determine whether the current state is transmitted to the corresponding neighbor agents, which avoids the frequent real-time communication. Then, to protect the privacy of initial states from disclosure, the edge-based mutually independent standard white noise is added to each communication channel. Further, to attenuate the effect of noise on consensus control, we propose a stochastic approximation type protocol for each agent. By using the tools of stochastic analysis and graph theory, the asymptotic property and convergence accuracy of consensus error is analyzed. Finally, a numerical simulation is given to illustrate the effectiveness of the proposed scheme.


The Geometry of Adversarial Training in Binary Classification

Bungert, Leon, Trillos, Nicolás García, Murray, Ryan

arXiv.org Machine Learning

We establish an equivalence between a family of adversarial training problems for non-parametric binary classification and a family of regularized risk minimization problems where the regularizer is a nonlocal perimeter functional. The resulting regularized risk minimization problems admit exact convex relaxations of the type $L^1+$ (nonlocal) $\operatorname{TV}$, a form frequently studied in image analysis and graph-based learning. A rich geometric structure is revealed by this reformulation which in turn allows us to establish a series of properties of optimal solutions of the original problem, including the existence of minimal and maximal solutions (interpreted in a suitable sense), and the existence of regular solutions (also interpreted in a suitable sense). In addition, we highlight how the connection between adversarial training and perimeter minimization problems provides a novel, directly interpretable, statistical motivation for a family of regularized risk minimization problems involving perimeter/total variation. The majority of our theoretical results are independent of the distance used to define adversarial attacks.


Collective Argumentation: The Case of Aggregating Support-Relations of Bipolar Argumentation Frameworks

Chen, Weiwei

arXiv.org Artificial Intelligence

In many real-life situations that involve exchanges of arguments, individuals may differ on their assessment of which supports between the arguments are in fact justified, i.e., they put forward different support-relations. When confronted with such situations, we may wish to aggregate individuals' argumentation views on support-relations into a collective view, which is acceptable to the group. In this paper, we assume that under bipolar argumentation frameworks, individuals are equipped with a set of arguments and a set of attacks between arguments, but with possibly different support-relations. Using the methodology in social choice theory, we analyze what semantic properties of bipolar argumentation frameworks can be preserved by aggregation rules during the aggregation of support-relations.


Generalized Policy Elimination: an efficient algorithm for Nonparametric Contextual Bandits

Bibaut, Aurélien F., Chambaz, Antoine, van der Laan, Mark J.

arXiv.org Machine Learning

We propose the Generalized Policy Elimination (GPE) algorithm, an oracle-efficient contextual bandit (CB) algorithm inspired by the Policy Elimination algorithm of \cite{dudik2011}. We prove the first regret optimality guarantee theorem for an oracle-efficient CB algorithm competing against a nonparametric class with infinite VC-dimension. Specifically, we show that GPE is regret-optimal (up to logarithmic factors) for policy classes with integrable entropy. For classes with larger entropy, we show that the core techniques used to analyze GPE can be used to design an $\varepsilon$-greedy algorithm with regret bound matching that of the best algorithms to date. We illustrate the applicability of our algorithms and theorems with examples of large nonparametric policy classes, for which the relevant optimization oracles can be efficiently implemented.


On the Optimality of Gaussian Kernel Based Nonparametric Tests against Smooth Alternatives

Li, Tong, Yuan, Ming

arXiv.org Machine Learning

Nonparametric tests via kernel embedding of distributions have witnessed a great deal of practical successes in recent years. However, statistical properties of these tests are largely unknown beyond consistency against a fixed alternative. To fill in this void, we study here the asymptotic properties of goodness-of-fit, homogeneity and independence tests using Gaussian kernels, arguably the most popular and successful among such tests. Our results provide theoretical justifications for this common practice by showing that tests using Gaussian kernel with an appropriately chosen scaling parameter are minimax optimal against smooth alternatives in all three settings. In addition, our analysis also pinpoints the importance of choosing a diverging scaling parameter when using Gaussian kernels and suggests a data-driven choice of the scaling parameter that yields tests optimal, up to an iterated logarithmic factor, over a wide range of smooth alternatives. Numerical experiments are also presented to further demonstrate the practical merits of the methodology.


Non-parametric Sparse Additive Auto-regressive Network Models

Zhou, Hao Henry, Raskutti, Garvesh

arXiv.org Machine Learning

Consider a multi-variate time series $(X_t)_{t=0}^{T}$ where $X_t \in \mathbb{R}^d$ which may represent spike train responses for multiple neurons in a brain, crime event data across multiple regions, and many others. An important challenge associated with these time series models is to estimate an influence network between the $d$ variables, especially when the number of variables $d$ is large meaning we are in the high-dimensional setting. Prior work has focused on parametric vector auto-regressive models. However, parametric approaches are somewhat restrictive in practice. In this paper, we use the non-parametric sparse additive model (SpAM) framework to address this challenge. Using a combination of $\beta$ and $\phi$-mixing properties of Markov chains and empirical process techniques for reproducing kernel Hilbert spaces (RKHSs), we provide upper bounds on mean-squared error in terms of the sparsity $s$, logarithm of the dimension $\log d$, number of time points $T$, and the smoothness of the RKHSs. Our rates are sharp up to logarithm factors in many cases. We also provide numerical experiments that support our theoretical results and display potential advantages of using our non-parametric SpAM framework for a Chicago crime dataset.


Transfer Learning with Label Noise

Yu, Xiyu, Liu, Tongliang, Gong, Mingming, Zhang, Kun, Tao, Dacheng

arXiv.org Machine Learning

Transfer learning aims to improve learning in the target domain with limited training data by borrowing knowledge from a related but different source domain with sufficient labeled data. To reduce the distribution shift between source and target domains, recent methods have focused on exploring invariant representations that have similar distributions across domains. However, existing methods assume that the labels in the source domain are uncontaminated, while in reality, we often only have access to a source domain with noisy labels. In this paper, we first analyze the effects of label noise in various transfer learning scenarios in which the data distribution is assumed to change in different ways. We find that although label noise has no effect on the invariant representation learning in the covariate shift scenario, it has adverse effects on the learning process in the more general target/conditional shift scenarios. To solve this problem, we propose a new transfer learning method to learn invariant representations in the presence of label noise, which also simultaneously estimates the label distributions in the target domain. Experimental results on both synthetic and real-world data verify the effectiveness of the proposed method.