Mathematical & Statistical Methods
Efficient Learning in Non-Stationary Linear Markov Decision Processes
Touati, Ahmed, Vincent, Pascal
We study episodic reinforcement learning in non-stationary linear (a.k.a. low-rank) Markov Decision Processes (MDPs), i.e, both the reward and transition kernel are linear with respect to a given feature map and are allowed to evolve either slowly or abruptly over time. For this problem setting, we propose OPT-WLSVI an optimistic model-free algorithm based on weighted least squares value iteration which uses exponential weights to smoothly forget data that are far in the past. We show that our algorithm, when competing against the best policy at each time, achieves a regret that is upped bounded by $\widetilde{\mathcal{O}}(d^{7/6}H^2 \Delta^{1/3} K^{2/3})$ where $d$ is the dimension of the feature space, $H$ is the planning horizon, $K$ is the number of episodes and $\Delta$ is a suitable measure of non-stationarity of the MDP. This is the first regret bound for non-stationary reinforcement learning with linear function approximation.
Convergence and Stability of Graph Convolutional Networks on Large Random Graphs
Keriven, Nicolas, Bietti, Alberto, Vaiter, Samuel
We study properties of Graph Convolutional Networks (GCNs) by analyzing their behavior on standard models of random graphs, where nodes are represented by random latent variables and edges are drawn according to a similarity kernel. This allows us to overcome the difficulties of dealing with discrete notions such as isomorphisms on very large graphs, by considering instead more natural geometric aspects. We first study the convergence of GCNs to their continuous counterpart as the number of nodes grows. Our results are fully non-asymptotic and are valid for relatively sparse graphs with an average degree that grows logarithmically with the number of nodes. We then analyze the stability of GCNs to small deformations of the random graph model. In contrast to previous studies of stability in discrete settings, our continuous setup allows us to provide more intuitive deformation-based metrics for understanding stability, which have proven useful for explaining the success of convolutional representations on Euclidean domains.
Kernel Smoothing, Mean Shift, and Their Learning Theory with Directional Data
Directional data consist of observations distributed on a (hyper)sphere, and appear in many applied fields, such as astronomy, ecology, and environmental science. This paper studies both statistical and computational problems of kernel smoothing for directional data. We generalize the classical mean shift algorithm to directional data, which allows us to identify local modes of the directional kernel density estimator (KDE). The statistical convergence rates of the directional KDE and its derivatives are derived, and the problem of mode estimation is examined. We also prove the ascending property of our directional mean shift algorithm and investigate a general problem of gradient ascent on the unit hypersphere. To demonstrate the applicability of our proposed algorithm, we evaluate it as a mode clustering method on both simulated and real-world datasets.
Master Linear Algebra for Data Science & Machine Learning DL
Then, this course is for you. The Common mistake by a data scientist is Applying the tools without the intuition of how it works and behaves. Having the solid foundation of mathematics will help you to understand how each algorithm work, its limitations and its underlying assumptions. With this, you will have an edge over your peers and makes you more confident in all the applications of Machine Learning, Data Science, and Deep Learning. As a common saying: It always pays to know the machinery under the hood, rather than being a guy who is just behind the wheel with no knowledge about the car.
On Random Subset Generalization Error Bounds and the Stochastic Gradient Langevin Dynamics Algorithm
Rodrรญguez-Gรกlvez, Borja, Bassi, Germรกn, Thobaben, Ragnar, Skoglund, Mikael
In this work, we unify several expected generalization error bounds based on random subsets using the framework developed by Hellstr\"om and Durisi [1]. First, we recover the bounds based on the individual sample mutual information from Bu et al. [2] and on a random subset of the dataset from Negrea et al. [3]. Then, we introduce their new, analogous bounds in the randomized subsample setting from Steinke and Zakynthinou [4], and we identify some limitations of the framework. Finally, we extend the bounds from Haghifam et al. [5] for Langevin dynamics to stochastic gradient Langevin dynamics and we refine them for loss functions with potentially large gradient norms.
Characterizing Deep Gaussian Processes via Nonlinear Recurrence Systems
Recent advances in Deep Gaussian Processes (DGPs) show the potential to have more expressive representation than that of traditional Gaussian Processes (GPs). However, there exists a pathology of deep Gaussian processes that their learning capacities reduce significantly when the number of layers increases. In this paper, we present a new analysis in DGPs by studying its corresponding nonlinear dynamic systems to explain the issue. Existing work reports the pathology for the squared exponential kernel function. We extend our investigation to four types of common stationary kernel functions. The recurrence relations between layers are analytically derived, providing a tighter bound and the rate of convergence of the dynamic systems. We demonstrate our finding with a number of experimental results.
A Kernel Two-Sample Test for Functional Data
Wynne, George, Duncan, Andrew B.
Nonparametric two-sample tests for equality of distributions are widely studied in statistics, driven by applications in goodness-of-fit tests, anomaly and change-point detection and clustering. Classical examples of such tests include the Kolmogorov-Smirnov test [41, 69, 62] and Wald-Wolfowitz runs test [84] with subsequent multivariate extensions [25]. Due to advances in the ability to collect large amounts of real time or spatially distributed data there is a need to develop statistical methods appropriate for functional data, where each data sample is a discretised function. Such data has been studied for decades in the Functional Data Analysis (FDA) literature [32, 35] particularly in the context of analysing populations of time series, or in statistical shape analysis [45]. More recently, due to this modern abundance of functional data, increased study has been made in the machine learning literature for algorithms suited to such data [7, 15, 37, 12, 88].
Faster Convergence of Stochastic Gradient Langevin Dynamics for Non-Log-Concave Sampling
Zou, Difan, Xu, Pan, Gu, Quanquan
We establish a new convergence analysis of stochastic gradient Langevin dynamics (SGLD) for sampling from a class of distributions that can be non-log-concave. At the core of our approach is a novel conductance analysis of SGLD using an auxiliary time-reversible Markov Chain. Under certain conditions on the target distribution, we prove that $\tilde O(d^4\epsilon^{-2})$ stochastic gradient evaluations suffice to guarantee $\epsilon$-sampling error in terms of the total variation distance, where $d$ is the problem dimension, which improves existing results on the convergence rate of SGLD (Raginsky et al., 2017; Xu et al., 2018). We further show that provided an additional Hessian Lipschitz condition on the log-density function, SGLD is guaranteed to achieve $\epsilon$-sampling error within $\tilde O(d^{15/4}\epsilon^{-3/2})$ stochastic gradient evaluations. Our proof technique provides a new way to study the convergence of Langevin based algorithms, and sheds some light on the design of fast stochastic gradient based sampling algorithms.
IBM Machine Learning
Offered by IBM. Machine Learning is one of the most in-demand skills for jobs related to modern AI applications, a field in which hiring has grown 74% annually for the last four years (LinkedIn). This Professional Certificate from IBM is intended for anyone interested in developing skills and experience to pursue a career in Machine Learning and leverage the main types of Machine Learning: Unsupervised Learning, Supervised Learning, Deep Learning, and Reinforcement Learning. It also complements your learning with special topics, including Time Series Analysis and Survival Analysis. This program consists of 6 courses providing you with solid theoretical understanding and considerable practice of the main algorithms, uses, and best practices related to Machine Learning . You will follow along and code your own projects using some of the most relevant open source frameworks and libraries. Although it is recommended that you have some background in Python programming, statistics, and linear algebra, this intermediate series is suitable for anyone who has some computer skills, interest in leveraging data, and a passion for self-learning. We start small, provide a solid theoretical background and code-along labs and demos, and build up to more complex topics. In addition to earning a Professional Certificate from Coursera, you will also receive a digital Badge from IBM recognizing your proficiency in Machine Learning.
Joint Inference of Multiple Graphs from Matrix Polynomials
Navarro, Madeline, Wang, Yuhao, Marques, Antonio G., Uhler, Caroline, Segarra, Santiago
Inferring graph structure from observations on the nodes is an important and popular network science task. Departing from the more common inference of a single graph and motivated by social and biological networks, we study the problem of jointly inferring multiple graphs from the observation of signals at their nodes (graph signals), which are assumed to be stationary in the sought graphs. From a mathematical point of view, graph stationarity implies that the mapping between the covariance of the signals and the sparse matrix representing the underlying graph is given by a matrix polynomial. A prominent example is that of Markov random fields, where the inverse of the covariance yields the sparse matrix of interest. From a modeling perspective, stationary graph signals can be used to model linear network processes evolving on a set of (not necessarily known) networks. Leveraging that matrix polynomials commute, a convex optimization method along with sufficient conditions that guarantee the recovery of the true graphs are provided when perfect covariance information is available. Particularly important from an empirical viewpoint, we provide high-probability bounds on the recovery error as a function of the number of signals observed and other key problem parameters. Numerical experiments using synthetic and real-world data demonstrate the effectiveness of the proposed method with perfect covariance information as well as its robustness in the noisy regime.