Goto

Collaborating Authors

 Mathematical & Statistical Methods


Delayed rejection Hamiltonian Monte Carlo for sampling multiscale distributions

arXiv.org Machine Learning

The efficiency of Hamiltonian Monte Carlo (HMC) can suffer when sampling a distribution with a wide range of length scales, because the small step sizes needed for stability in high-curvature regions are inefficient elsewhere. To address this we present a delayed rejection variant: if an initial HMC trajectory is rejected, we make one or more subsequent proposals each using a step size geometrically smaller than the last. We extend the standard delayed rejection framework by allowing the probability of a retry to depend on the probability of accepting the previous proposal. We test the scheme in several sampling tasks, including multiscale model distributions such as Neal's funnel, and statistical applications. Delayed rejection enables up to five-fold performance gains over optimally-tuned HMC, as measured by effective sample size per gradient evaluation. Even for simpler distributions, delayed rejection provides increased robustness to step size misspecification. Along the way, we provide an accessible but rigorous review of detailed balance for HMC. Keywords: delayed rejection, Hamiltonian Monte Carlo, detailed balance, multiscale.


Emergence and algorithmic information dynamics of systems and observers

arXiv.org Artificial Intelligence

Previous work has shown that perturbation analysis in software space can produce candidate computable generative models and uncover possible causal properties from the finite description of an object or system quantifying the algorithmic contribution of each of its elements relative to the whole. One of the challenges for defining emergence is that one observer's prior knowledge may cause a phenomenon to present itself to such observer as emergent while for another as reducible. By formalising the act of observing as mutual perturbations between dynamical systems, we demonstrate that emergence of algorithmic information do depend on the observer's formal knowledge, while robust to other subjective factors, particularly: the choice of the programming language and the measurement method; errors or distortions during the information acquisition; and the informational cost of processing. This is called observer-dependent emergence (ODE). In addition, we demonstrate that the unbounded and fast increase of emergent algorithmic information implies asymptotically observer-independent emergence (AOIE). Unlike ODE, AOIE is a type of emergence for which emergent phenomena will remain considered to be emergent for every formal theory that any observer might devise. We demonstrate the existence of an evolutionary model that displays the diachronic variant of AOIE and a network model that displays the holistic variant of AOIE. Our results show that, restricted to the context of finite discrete deterministic dynamical systems, computable systems, and irreducible information content measures, AOIE is the strongest form of emergence that formal theories can attain.


Extracting stochastic dynamical systems with $\alpha$-stable L\'evy noise from data

arXiv.org Machine Learning

From this point of view, dynamical modeling requires a deep understanding of the process to be analyzed. The essence of model abstraction is an approximation to the observed reality, which is usually represented by a system composed of ordinary or partial differential equations, deterministic or stochastic differential equations, and control equations. Although mathematical models are accurate for many processes, it is particularly difficult to develop such models for some of the most challenging systems, including climate dynamics, brain dynamics, biological systems and financial markets. Fortunately, more and more data are observed or measured in recent years with the development of scientific tools and simulation capabilities. Therefore, a large number of data-driven methods has been proposed to discover governing laws of systems from data. For instance, several researchers designed the Sparse Identification of Nonlinear Dynamics approach to extract deterministic ordinary [5] or partial [15, 29, 31] differential equations from available path data.


5 Best Online Biostatistics Programs and Courses

#artificialintelligence

Are you looking for Best Online Biostatistics Programs and Courses?… If yes, then your search will end here. In this article, I am going to share the 5 Best Online Biostatistics Programs and Courses with you. So, give your few minutes to this article and find out the best online Biostatistics program for you. The goal of Biostatistics is to advance statistical science and its application to problems of human health and disease, with the ultimate goal of advancing the public's health.


Sinkhorn Distributionally Robust Optimization

arXiv.org Machine Learning

Decision-making problems under uncertainty have broad applications in operations research, machine learning, engineering, and economics. When the data involves uncertainty due to measurement error, insufficient sample size, contamination, and anomalies, or model misspecification, distributionally robust optimization (DRO) is a promising approach to data-driven optimization, by seeking a minimax robust optimal decision that minimizes the expected loss under the most adverse distribution within a given set of relevant distributions, called ambiguity set. It provides a principled framework to produce a solution with more promising out-of-sample performance than the traditional sample average approximation (SAA) method for stochastic programming [86]. We refer to [81] for a recent survey on DRO. At the core of DRO is the choice of the ambiguity set. Ideally, a good ambiguity set should take account of the properties of practical applications while maintaining the computational tractability of resulted DRO formulation; and it should be rich enough to contain all distributions relevant to the decision-making but, at the same time, should not include unnecessary distributions that lead to overly conservative decisions. Various DRO formulations have been proposed in the literature. Among them, the ambiguity set based on Wasserstein distance has recently received much attention [104, 67, 17, 46]. The Wasserstein distance incorporates the geometry of sample space, and thereby is suitable for comparing distributions with non-overlapping supports and hedging against data perturbations [46].


Adaptive Sampling Quasi-Newton Methods for Zeroth-Order Stochastic Optimization

arXiv.org Artificial Intelligence

Several methods have been proposed to solve such derivative-free stochastic optimization problems, and we refer the reader to [3, 38] for surveys of these methods. A popular class of these methods estimate the gradients using function values and employ standard gradient-based optimization methods using these estimators. Quasi-Newton methods are recognized as one of the most powerful methods for solving deterministic optimization problems. These methods build quadratic models of the objective information using only gradient information. Recently, researchers have been adapting these methods for stochastic settings when the gradient information is available. The empirical results in [15] indicate that a careful implementation of these methods can be efficient compared with the popular stochastic gradient methods. We adapt these methods to make them suitable for situations where the gradients are estimated using function values. We propose finite-difference derivative-free stochastic quasi-Newton methods for solving (1) by exploiting common random number (CRN) evaluations of f.


Revisiting the Characteristics of Stochastic Gradient Noise and Dynamics

arXiv.org Artificial Intelligence

In this paper, we characterize the noise of stochastic gradients and analyze the noise-induced dynamics during training deep neural networks by gradient-based optimizers. Specifically, we firstly show that the stochastic gradient noise possesses finite variance, and therefore the classical Central Limit Theorem (CLT) applies; this indicates that the gradient noise is asymptotically Gaussian. Such an asymptotic result validates the wide-accepted assumption of Gaussian noise. We clarify that the recently observed phenomenon of heavy tails within gradient noise may not be intrinsic properties, but the consequence of insufficient mini-batch size; the gradient noise, which is a sum of limited i.i.d. random variables, has not reached the asymptotic regime of CLT, thus deviates from Gaussian. We quantitatively measure the goodness of Gaussian approximation of the noise, which supports our conclusion. Secondly, we analyze the noise-induced dynamics of stochastic gradient descent using the Langevin equation, granting for momentum hyperparameter in the optimizer with a physical interpretation. We then proceed to demonstrate the existence of the steady-state distribution of stochastic gradient descent and approximate the distribution at a small learning rate.


How Machine Learning Leverages Linear Algebra to Solve Data Problems - KDnuggets

#artificialintelligence

Machines or your computers only understand numbers and these numbers need to be represented and processed in a way that enables these machines to solve problems by learning from data instead of predefined instruction as in the case of programming. All types of programming use mathematics at some level and machine learning is programming data to learn the function that best describes the data. The problem(or process) of finding the best parameters of a function using data is called model training in ML. Therefore, in a nutshell, machine learning is programming to optimize for the best possible solution and we need math to understand how that problem is solved. The first step towards learning Math for ML is Linear algebra. Linear Algebra is that mathematical foundation that solves the problem of representing data as well as computations in machine learning models.


Low-rank statistical finite elements for scalable model-data synthesis

arXiv.org Machine Learning

Statistical learning additions to physically derived mathematical models are gaining traction in the literature. A recent approach has been to augment the underlying physics of the governing equations with data driven Bayesian statistical methodology. Coined statFEM, the method acknowledges a priori model misspecification, by embedding stochastic forcing within the governing equations. Upon receipt of additional data, the posterior distribution of the discretised finite element solution is updated using classical Bayesian filtering techniques. The resultant posterior jointly quantifies uncertainty associated with the ubiquitous problem of model misspecification and the data intended to represent the true process of interest. Despite this appeal, computational scalability is a challenge to statFEM's application to high-dimensional problems typically experienced in physical and industrial contexts. This article overcomes this hurdle by embedding a low-rank approximation of the underlying dense covariance matrix, obtained from the leading order modes of the full-rank alternative. Demonstrated on a series of reaction-diffusion problems of increasing dimension, using experimental and simulated data, the method reconstructs the sparsely observed data-generating processes with minimal loss of information, in both posterior mean and the variance, paving the way for further integration of physical and probabilistic approaches to complex systems.


Higher Order Kernel Mean Embeddings to Capture Filtrations of Stochastic Processes

arXiv.org Machine Learning

Stochastic processes are random variables with values in some space of paths. However, reducing a stochastic process to a path-valued random variable ignores its filtration, i.e. the flow of information carried by the process through time. By conditioning the process on its filtration, we introduce a family of higher order kernel mean embeddings (KMEs) that generalizes the notion of KME and captures additional information related to the filtration. We derive empirical estimators for the associated higher order maximum mean discrepancies (MMDs) and prove consistency. We then construct a filtration-sensitive kernel two-sample test able to pick up information that gets missed by the standard MMD test. In addition, leveraging our higher order MMDs we construct a family of universal kernels on stochastic processes that allows to solve real-world calibration and optimal stopping problems in quantitative finance (such as the pricing of American options) via classical kernel-based regression methods. Finally, adapting existing tests for conditional independence to the case of stochastic processes, we design a causal-discovery algorithm to recover the causal graph of structural dependencies among interacting bodies solely from observations of their multidimensional trajectories.