Mukherjee, Sayan
Approximately Optimal Search on a Higher-dimensional Sliding Puzzle
Merleau, Nono SC, O'Malley, Miguel, Roldán, Érika, Mukherjee, Sayan
Higher-dimensional sliding puzzles are constructed on the vertices of a $d$-dimensional hypercube, where $2^d-l$ vertices are distinctly coloured. Rings with the same colours are initially set randomly on the vertices of the hypercube. The goal of the puzzle is to move each of the $2^d-l$ rings to pre-defined target vertices on the cube. In this setting, the $k$-rule constraint represents a generalisation of edge collision for the movement of colours between vertices, allowing movement only when a hypercube face of dimension $k$ containing a ring is completely free of other rings. Starting from an initial configuration, what is the minimum number of moves needed to make ring colours match the vertex colours? An algorithm that provides us with such a number is called God's algorithm. When such an algorithm exists, it does not have a polynomial time complexity, at least in the case of the 15-puzzle corresponding to $k=1$ in the cubical puzzle. This paper presents a comprehensive computational study of different scenarios of the higher-dimensional puzzle. A benchmark of three computational techniques, an exact algorithm (the A* search) and two approximately optimal search techniques (an evolutionary algorithm (EA) and reinforcement learning (RL)) is presented in this work. The experiments show that all three methods can successfully solve the puzzle of dimension three for different face dimensions and across various difficulty levels. When the dimension increases, the A* search fails, and RL and EA methods can still provide a generally acceptable solution, i.e. a distribution of a number of moves with a median value of less than $30$. Overall, the EA method consistently requires less computational time, while failing in most cases to minimise the number of moves for the puzzle dimensions $d=4$ and $d=5$.
Pragmatist Intelligence: Where the Principle of Usefulness Can Take ANNs
Bikić, Antonio, Mukherjee, Sayan
Artificial neural networks (ANNs) perform extraordinarily on numerous tasks including classification or prediction, e.g., speech processing and image classification. These new functions are based on a computational model that is enabled to select freely all necessary internal model parameters as long as it eventually delivers the functionality it is supposed to exhibit. Here, we review the connection between the model parameter selection in machine learning (ML) algorithms running on ANNs and the epistemological theory of neopragmatism focusing on the theory's utility and anti-representationalist aspects. To understand the consequences of the model parameter selection of an ANN, we suggest using neopragmatist theories whose implications are well studied. Incidentally, neopragmatism's notion of optimization is also based on utility considerations. This means that applying this approach elegantly reveals the inherent connections between optimization in ML, using a numerical method during the learning phase, and optimization in the ethical theory of consequentialism, where it occurs as a maxim of action. We suggest that these connections originate from the way relevance is calculated in ML systems. This could ultimately reveal a tendency for specific actions in ML systems.
Scalable Bayesian inference for the generalized linear mixed model
Berchuck, Samuel I., Medeiros, Felipe A., Mukherjee, Sayan, Agazzi, Andrea
The generalized linear mixed model (GLMM) is a popular statistical approach for handling correlated data, and is used extensively in applications areas where big data is common, including biomedical data settings. The focus of this paper is scalable statistical inference for the GLMM, where we define statistical inference as: (i) estimation of population parameters, and (ii) evaluation of scientific hypotheses in the presence of uncertainty. Artificial intelligence (AI) learning algorithms excel at scalable statistical estimation, but rarely include uncertainty quantification. In contrast, Bayesian inference provides full statistical inference, since uncertainty quantification results automatically from the posterior distribution. Unfortunately, Bayesian inference algorithms, including Markov Chain Monte Carlo (MCMC), become computationally intractable in big data settings. In this paper, we introduce a statistical inference algorithm at the intersection of AI and Bayesian inference, that leverages the scalability of modern AI algorithms with guaranteed uncertainty quantification that accompanies Bayesian inference. Our algorithm is an extension of stochastic gradient MCMC with novel contributions that address the treatment of correlated data (i.e., intractable marginal likelihood) and proper posterior variance estimation. Through theoretical and empirical results we establish our algorithm's statistical inference properties, and apply the method in a large electronic health records database.
Asymptotics of Bayesian Uncertainty Estimation in Random Features Regression
Baek, Youngsoo, Berchuck, Samuel I., Mukherjee, Sayan
In this paper, we compare and contrast the behavior of the posterior predictive distribution to the risk of the maximum a posteriori (MAP) estimator for the random features regression model in the overparameterized regime. We will focus on the variance of the posterior predictive distribution (Bayesian model average) and compare its asymptotics to that of the risk of the MAP estimator. In the regime where the model dimensions grow faster than any constant multiple of the number of samples, asymptotic agreement between these two quantities is governed by the phase transition in the signal-to-noise ratio. They also asymptotically agree with each other when the number of samples grows faster than any constant multiple of model dimensions. Numerical simulations illustrate finer distributional properties of the two quantities for finite dimensions. We conjecture they have Gaussian fluctuations and exhibit similar properties as found by previous authors in a Gaussian sequence model, which is of independent theoretical interest.
Concentration inequalities and optimal number of layers for stochastic deep neural networks
Caprio, Michele, Mukherjee, Sayan
We state concentration inequalities for the output of the hidden layers of a stochastic deep neural network (SDNN), as well as for the output of the whole SDNN. These results allow us to introduce an expected classifier (EC), and to give probabilistic upper bound for the classification error of the EC. We also state the optimal number of layers for the SDNN via an optimal stopping procedure. We apply our analysis to a stochastic version of a feedforward neural network with ReLU activation function.
Global Optimality of Elman-type RNN in the Mean-Field Regime
Agazzi, Andrea, Lu, Jianfeng, Mukherjee, Sayan
We analyze Elman-type Recurrent Reural Networks (RNNs) and their training in the mean-field regime. Specifically, we show convergence of gradient descent training dynamics of the RNN to the corresponding mean-field formulation in the large width limit. We also show that the fixed points of the limiting infinite-width dynamics are globally optimal, under some assumptions on the initialization of the weights. Our results establish optimality for feature-learning with wide RNNs in the mean-field regime
At the Intersection of Deep Sequential Model Framework and State-space Model Framework: Study on Option Pricing
Ding, Ziyang, Mukherjee, Sayan
Inference and forecast problems of the nonlinear dynamical system have arisen in a variety of contexts. Reservoir computing and deep sequential models, on the one hand, have demonstrated efficient, robust, and superior performance in modeling simple and chaotic dynamical systems. However, their innate deterministic feature has partially detracted their robustness to noisy system, and their inability to offer uncertainty measurement has also been an insufficiency of the framework. On the other hand, the traditional state-space model framework is robust to noise. It also carries measured uncertainty, forming a just-right complement to the reservoir computing and deep sequential model framework. We propose the unscented reservoir smoother, a model that unifies both deep sequential and state-space models to achieve both frameworks' superiorities. Evaluated in the option pricing setting on top of noisy datasets, URS strikes highly competitive forecasting accuracy, especially those of longer-term, and uncertainty measurement. Further extensions and implications on URS are also discussed to generalize a full integration of both frameworks.
Stanza: A Nonlinear State Space Model for Probabilistic Inference in Non-Stationary Time Series
Yanchenko, Anna K., Mukherjee, Sayan
Time series with long-term structure arise in a variety of contexts and capturing this temporal structure is a critical challenge in time series analysis for both inference and forecasting settings. Traditionally, state space models have been successful in providing uncertainty estimates of trajectories in the latent space. More recently, deep learning, attention-based approaches have achieved state of the art performance for sequence modeling, though often require large amounts of data and parameters to do so. We propose Stanza, a nonlinear, non-stationary state space model as an intermediate approach to fill the gap between traditional models and modern deep learning approaches for complex time series. Stanza strikes a balance between competitive forecasting accuracy and probabilistic, interpretable inference for highly structured time series. In particular, Stanza achieves forecasting accuracy competitive with deep LSTMs on real-world datasets, especially for multi-step ahead forecasting.
A Case for Quantifying Statistical Robustness of Specialized Probabilistic AI Accelerators
Zhang, Xiangyu, Mukherjee, Sayan, Lebeck, Alvin R.
Statistical machine learning often uses probabilistic algorithms, such as Markov Chain Monte Carlo (MCMC), to solve a wide range of problems. Many accelerators are proposed using specialized hardware to address sampling inefficiency, the critical performance bottleneck of probabilistic algorithms. These accelerators usually improve the hardware efficiency by using some approximation techniques, such as reducing bit representation, truncating small values to zero, or simplifying the Random Number Generator (RNG). Understanding the influence of these approximations on result quality is crucial to meeting the quality requirements of real applications. Although a common approach is to compare the end-point result quality using community-standard benchmarks and metrics, we claim a probabilistic architecture should provide some measure (or guarantee) of statistical robustness. This work takes a first step towards quantifying the statistical robustness of specialized hardware MCMC accelerators by proposing three pillars of statistical robustness: sampling quality, convergence diagnostic, and goodness of fit. Each pillar has at least one quantitative metric without the need to know the ground truth data. We apply this method to analyze the statistical robustness of an MCMC accelerator proposed by previous work, with some modifications, as a case study. The method also applies to other probabilistic accelerators and can be used in design space exploration.
Scalable Modeling of Spatiotemporal Data using the Variational Autoencoder: an Application in Glaucoma
Berchuck, Samuel I., Medeiros, Felipe A., Mukherjee, Sayan
Submitted to the Annals of Applied Statistics SCALABLE MODELING OF SPATIOTEMPORAL DATA USING THE VARIATIONAL AUTOENCODER: AN APPLICATION IN GLAUCOMA By Samuel I. Berchuck, Felipe A. Medeiros and Sayan Mukherjee Duke University As big spatial data becomes increasingly prevalent, classical spatiotemporal (ST) methods often do not scale well. While methods have been developed to account for high-dimensional spatial objects, the setting where there are exceedingly large samples of spatial observations has had less attention. The variational autoencoder (V AE), an unsupervised generative model based on deep learning and approximate Bayesian inference, fills this void using a latent variable specification that is inferred jointly across the large number of samples. In this manuscript, we compare the performance of the V AE with a more classical ST method when analyzing longitudinal visual fields from a large cohort of patients in a prospective glaucoma study. Through simulation and a case study, we demonstrate that the V AE is a scalable method for analyzing ST data, when the goal is to obtain accurate predictions. R code to implement the V AE can be found on GitHub: https://github.com/berchuck/vaeST. 1. Introduction. As high-speed computing and medical imaging become increasingly inexpensive, massive amounts of data are generated that have to be analyzed and are often spatial in nature (Bearden and Thompson, 2017; Smith and Nichols, 2018). In the case of medical imaging, the number of patients that can be imaged has skyrocketed in recent years, allowing for studies that include images from many thousands of patients (Van Essen et al., 2013; Miller et al., 2016). The current spatial statistics literature focuses heavily on scalability in terms of the number of spatial locations (Banerjee, 2017), however largely ignores the setting where a joint model is needed for spatiotemporal (ST) data that are generated from a large cohort. Historically, learning an appropriate generating process in this setting was untenable, typically leading to simplifying assumptions, such as point-wise (PW) modeling of locations across time (Fitzke et al., 1996). In particular, generative models using deep learning have shown great promise in modeling complex distributions, p( x), for x x 1: M in some potentially high-dimensional space X . Sampling from X is often intractable, so instead generative modeling learns a distribution q (x) that can be sampled from and is close to p (x) (Doersch, 2016). As such, generative modeling can be viewed as an approximate method for performing inference in high-dimensional contexts, when there is an overwhelming availability of observations x . Generative modeling, and in particular the variational auto-encoder (V AE), are well-suited for modeling large cohorts of ST data, because they can characterize variability in a spatial data source through joint modeling (Kingma and Welling, 2013).