Goto

Collaborating Authors

 Accuracy


How Many Machines Can We Use in Parallel Computing for Kernel Ridge Regression?

arXiv.org Machine Learning

This paper attempts to solve a basic problem in distributed statistical inference: how many machines can we use in parallel computing? In kernel ridge regression, we address this question in two important settings: nonparametric estimation and hypothesis testing. Specifically, we find a range for the number of machines under which optimal estimation/testing is achievable. The employed empirical processes method provides a unified framework, that allows us to handle various regression problems (such as thin-plate splines and nonparametric additive regression) under different settings (such as univariate, multivariate and diverging-dimensional designs). It is worth noting that the upper bounds of the number of machines are proven to be un-improvable (up to a logarithmic factor) in two important cases: smoothing spline regression and Gaussian RKHS regression. Our theoretical findings are backed by thorough numerical studies.


Fairness GAN

arXiv.org Machine Learning

In this paper, we introduce the Fairness GAN, an approach for generating a dataset that is plausibly similar to a given multimedia dataset, but is more fair with respect to protected attributes in allocative decision making. We propose a novel auxiliary classifier GAN that strives for demographic parity or equality of opportunity and show empirical results on several datasets, including the CelebFaces Attributes (CelebA) dataset, the Quick, Draw!\ dataset, and a dataset of soccer player images and the offenses they were called for. The proposed formulation is well-suited to absorbing unlabeled data; we leverage this to augment the soccer dataset with the much larger CelebA dataset. The methodology tends to improve demographic parity and equality of opportunity while generating plausible images.


Structure Learning from Time Series with False Discovery Control

arXiv.org Machine Learning

We consider the Granger causal structure learning problem from time series data. Granger causal algorithms predict a 'Granger causal effect' between two variables by testing if prediction error of one decreases significantly in the absence of the other variable among the predictor covariates. Almost all existing Granger causal algorithms condition on a large number of variables (all but two variables) to test for effects between a pair of variables. We propose a new structure learning algorithm called MMPC-p inspired by the well known MMHC algorithm for non-time series data. We show that under some assumptions, the algorithm provides false discovery rate control. The algorithm is sound and complete when given access to perfect directed information testing oracles. We also outline a novel tester for the linear Gaussian case. We show through our extensive experiments that the MMPC-p algorithm scales to larger problems and has improved statistical power compared to existing state of the art for large sparse graphs. We also apply our algorithm on a global development dataset and validate our findings with subject matter experts.


New Insights into Bootstrapping for Bandits

arXiv.org Machine Learning

We investigate the use of bootstrapping in the bandit setting. We first show that the commonly used non-parametric bootstrapping (NPB) procedure can be provably inefficient and establish a near-linear lower bound on the regret incurred by it under the bandit model with Bernoulli rewards. We show that NPB with an appropriate amount of forced exploration can result in sub-linear albeit sub-optimal regret. As an alternative to NPB, we propose a weighted bootstrapping (WB) procedure. For Bernoulli rewards, WB with multiplicative exponential weights is mathematically equivalent to Thompson sampling (TS) and results in near-optimal regret bounds. Similarly, in the bandit setting with Gaussian rewards, we show that WB with additive Gaussian weights achieves near-optimal regret. Beyond these special cases, we show that WB leads to better empirical performance than TS for several reward distributions bounded on $[0,1]$. For the contextual bandit setting, we give practical guidelines that make bootstrapping simple and efficient to implement and result in good empirical performance on real-world datasets.


Towards Robust Evaluations of Continual Learning

arXiv.org Machine Learning

Continual learning experiments used in current deep learning papers do not faithfully assess fundamental challenges of learning continually, masking weak-points of the suggested approaches instead. We study gaps in such existing evaluations, proposing essential experimental evaluations that are more representative of continual learning's challenges, and suggest a re-prioritization of research efforts in the field. We show that current approaches fail with our new evaluations and, to analyse these failures, we propose a variational loss which unifies many existing solutions to continual learning under a Bayesian framing, as either 'prior-focused' or 'likelihood-focused'. We show that while prior-focused approaches such as EWC and VCL perform well on existing evaluations, they perform dramatically worse when compared to likelihood-focused approaches on other simple tasks.


Stable specification search in structural equation model with latent variables

arXiv.org Machine Learning

In our previous study, we introduced stable specification search for cross-sectional data (S3C). It is an exploratory causal method that combines stability selection concept and multi-objective optimization to search for stable and parsimonious causal structures across the entire range of model complexities. In this study, we extended S3C to S3C-Latent, to model causal relations between latent variables. We evaluated S3C-Latent on simulated data and compared the results to those of PC-MIMBuild, an extension of the PC algorithm, the state-of-the-art causal discovery method. The comparison showed that S3C-Latent achieved better performance. We also applied S3C-Latent to real-world data of children with attention deficit/hyperactivity disorder and data about measuring mental abilities among pupils. The results are consistent with those of previous studies.


Boolean Decision Rules via Column Generation

arXiv.org Artificial Intelligence

This paper considers the learning of Boolean rules in either disjunctive normal form (DNF, OR-of-ANDs, equivalent to decision rule sets) or conjunctive normal form (CNF, AND-of-ORs) as an interpretable model for classification. An integer program is formulated to optimally trade classification accuracy for rule simplicity. Column generation (CG) is used to efficiently search over an exponential number of candidate clauses (conjunctions or disjunctions) without the need for heuristic rule mining. This approach also bounds the gap between the selected rule set and the best possible rule set on the training data. To handle large datasets, we propose an approximate CG algorithm using randomization. Compared to three recently proposed alternatives, the CG algorithm dominates the accuracy-simplicity trade-off in 7 out of 15 datasets. When maximized for accuracy, CG is competitive with rule learners designed for this purpose, sometimes finding significantly simpler solutions that are no less accurate.


Concentric ESN: Assessing the Effect of Modularity in Cycle Reservoirs

arXiv.org Artificial Intelligence

The paper introduces concentric Echo State Network, an approach to design reservoir topologies that tries to bridge the gap between deterministically constructed simple cycle models and deep reservoir computing approaches. We show how to modularize the reservoir into simple unidirectional and concentric cycles with pairwise bidirectional jump connections between adjacent loops. We provide a preliminary experimental assessment showing how concentric reservoirs yield to superior predictive accuracy and memory capacity with respect to single cycle reservoirs and deep reservoir models.


Do Better ImageNet Models Transfer Better?

arXiv.org Machine Learning

Transfer learning has become a cornerstone of computer vision with the advent of ImageNet features, yet little work has been done to evaluate the performance of ImageNet architectures across different datasets. An implicit hypothesis in modern computer vision research is that models that perform better on ImageNet necessarily perform better on other vision tasks. However, this hypothesis has never been systematically tested. Here, we compare the performance of 13 classification models on 12 image classification tasks in three settings: as fixed feature extractors, fine-tuned, and trained from random initialization. We find that, when networks are used as fixed feature extractors, ImageNet accuracy is only weakly predictive of accuracy on other tasks ($r^2=0.24$). In this setting, ResNets consistently outperform networks that achieve higher accuracy on ImageNet. When networks are fine-tuned, we observe a substantially stronger correlation ($r^2 = 0.86$). We achieve state-of-the-art performance on eight image classification tasks simply by fine-tuning state-of-the-art ImageNet architectures, outperforming previous results based on specialized methods for transfer learning. Finally, we observe that, on three small fine-grained image classification datasets, networks trained from random initialization perform similarly to ImageNet-pretrained networks. Together, our results show that ImageNet architectures generalize well across datasets, with small improvements in ImageNet accuracy producing improvements across other tasks, but ImageNet features are less general than previously suggested.


Super learning in the SAS system

arXiv.org Machine Learning

Background and objective: Stacking is an ensemble machine learning method that averages predictions from multiple other algorithms, such as generalized linear models and regression trees. A recent iteration of stacking, called super learning, has been developed as a general approach to black box supervised learning and has seen frequent usage, in part due to the availability of an R package. I develop super learning in the SAS software system using a new macro, and demonstrate its performance relative to the R package. Methods: I follow closely previous work using the R SuperLearner package and assess the performance of super learning in a number of domains. I compare the R package with the new SAS macro in a small set of simulations assessing curve fitting in a prediction model, a set of 14 publicly available datasets to assess cross-validated, expected loss, and data from a randomized trial of job seekers' training to assess the utility of super learning in causal inference using inverse probability weighting. Results: Across the simulated data and the publicly available data, the macro performed similarly to the R package, even with a different set of potential algorithms available natively in R and SAS. The example with inverse probability weighting demonstrated the ability of the SAS macro to include algorithms developed in R. Conclusions: The super learner macro performs as well as the R package at a number of tasks. Further, by extending the macro to include the use of R packages, the macro can leverage both the robust, enterprise oriented procedures in SAS and the nimble, cutting edge packages in R. In the spirit of ensemble learning, this macro extends the potential library of algorithms beyond a single software system and provides a simple avenue into machine learning in SAS.