covtype
In Table A, we repeat our experiments on 5000 test examples for each dataset (or the
We thank all reviewers for their valuable comments and suggestions. Table B highlights our differences. Methods on bottom-left corner are better. We will enlarge figures and explain more. In Table 2 and 3, HIGGS contains 10.5 million training examples and the ensemble We additionally added Bosch (1.2 million examples, 968 features) in Table A. Both datasets are from Our method is effective on both datasets.
In Table A, we repeat our experiments on 5000 test examples for each dataset (or the
We thank all reviewers for their valuable comments and suggestions. Table B highlights our differences. Methods on bottom-left corner are better. We will enlarge figures and explain more. In Table 2 and 3, HIGGS contains 10.5 million training examples and the ensemble We additionally added Bosch (1.2 million examples, 968 features) in Table A. Both datasets are from Our method is effective on both datasets.
Trading Computation for Communication: Distributed Stochastic Dual Coordinate Ascent
We present and study a distributed optimization algorithm by employing a stochastic dual coordinate ascent method. Stochastic dual coordinate ascent methods enjoy strong theoretical guarantees and often have better performances than stochastic gradient descent methods in optimizing regularized loss minimization problems. It still lacks of efforts in studying them in a distributed framework. We make a progress along the line by presenting a distributed stochastic dual coordinate ascent algorithm in a star network, with an analysis of the tradeoff between computation and communication. We verify our analysis by experiments on real data sets. Moreover, we compare the proposed algorithm with distributed stochastic gradient descent methods and distributed alternating direction methods of multipliers for optimizing SVMs in the same distributed framework, and observe competitive performances.
Data Selection: A Surprisingly Effective and General Principle for Building Small Interpretable Models
We present convincing empirical evidence for an effective and general strategy for building accurate small models. Such models are attractive for interpretability and also find use in resource-constrained environments. The strategy is to learn the training distribution instead of using data from the test distribution. The distribution learning algorithm is not a contribution of this work; we highlight the broad usefulness of this simple strategy on a diverse set of tasks, and as such these rigorous empirical results are our contribution. We apply it to the tasks of (1) building cluster explanation trees, (2) prototype-based classification, and (3) classification using Random Forests, and show that it improves the accuracy of weak traditional baselines to the point that they are surprisingly competitive with specialized modern techniques. This strategy is also versatile wrt the notion of model size. In the first two tasks, model size is identified by number of leaves in the tree and the number of prototypes respectively. In the final task involving Random Forests the strategy is shown to be effective even when model size is determined by more than one factor: number of trees and their maximum depth. Positive results using multiple datasets are presented that are shown to be statistically significant. These lead us to conclude that this strategy is both effective, i.e, leads to significant improvements, and general, i.e., is applicable to different tasks and model families, and therefore merits further attention in domains that require small accurate models.
- North America > United States > New York > New York County > New York City (0.04)
- North America > United States > California > San Diego County > San Diego (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- Asia > China (0.04)
- Research Report > New Finding (0.68)
- Research Report > Experimental Study (0.46)
Theoretically Better and Numerically Faster Distributed Optimization with Smoothness-Aware Quantization Techniques
Wang, Bokun, Safaryan, Mher, Richtárik, Peter
To address the high communication costs of distributed machine learning, a large body of work has been devoted in recent years to designing various compression strategies, such as sparsification and quantization, and optimization algorithms capable of using them. Recently, Safaryan et al. (2021) pioneered a dramatically different compression design approach: they first use the local training data to form local smoothness matrices and then propose to design a compressor capable of exploiting the smoothness information contained therein. While this novel approach leads to substantial savings in communication, it is limited to sparsification as it crucially depends on the linearity of the compression operator. In this work, we generalize their smoothness-aware compression strategy to arbitrary unbiased compression operators, which also include sparsification. Specializing our results to stochastic quantization, we guarantee significant savings in communication complexity compared to standard quantization. In particular, we prove that block quantization with $n$ blocks theoretically outperforms single block quantization, leading to a reduction in communication complexity by an $\mathcal{O}(n)$ factor, where $n$ is the number of nodes in the distributed system. Finally, we provide extensive numerical evidence with convex optimization problems that our smoothness-aware quantization strategies outperform existing quantization schemes as well as the aforementioned smoothness-aware sparsification strategies with respect to three evaluation metrics: the number of iterations, the total amount of bits communicated, and wall-clock time.
- Asia > Middle East > Saudi Arabia (0.04)
- North America > United States > Texas (0.04)
- North America > United States > Florida > Hillsborough County > University (0.04)
- (2 more...)
- Research Report > New Finding (0.48)
- Research Report > Promising Solution (0.34)
Active Anomaly Detection via Ensembles
Das, Shubhomoy, Islam, Md Rakibul, Jayakodi, Nitthilan Kannappan, Doppa, Janardhan Rao
In critical applications of anomaly detection including computer security and fraud prevention, the anomaly detector must be configurable by the analyst to minimize the effort on false positives. One important way to configure the anomaly detector is by providing true labels for a few instances. We study the problem of label-efficient active learning to automatically tune anomaly detection ensembles and make four main contributions. First, we present an important insight into how anomaly detector ensembles are naturally suited for active learning. This insight allows us to relate the greedy querying strategy to uncertainty sampling, with implications for label-efficiency. Second, we present a novel formalism called compact description to describe the discovered anomalies and show that it can also be employed to improve the diversity of the instances presented to the analyst without loss in the anomaly discovery rate. Third, we present a novel data drift detection algorithm that not only detects the drift robustly, but also allows us to take corrective actions to adapt the detector in a principled manner. Fourth, we present extensive experiments to evaluate our insights and algorithms in both batch and streaming settings. Our results show that in addition to discovering significantly more anomalies than state-of-the-art unsupervised baselines, our active learning algorithms under the streaming-data setup are competitive with the batch setup.
- Oceania > Australia > New South Wales (0.04)
- North America > United States > Wisconsin > Dane County > Madison (0.04)
- North America > United States > Washington > Whitman County > Pullman (0.04)
- North America > United States > Washington > King County > Redmond (0.04)
SVRG meets SAGA: k-SVRG --- A Tale of Limited Memory
Raj, Anant, Stich, Sebastian U.
In recent years, many variance reduced algorithms for empirical risk minimization have been introduced. In contrast to vanilla SGD, these methods converge linearly on strong convex problems. To obtain the variance reduction, current methods either require frequent passes over the full data to recompute gradients---without making any progress during this time (like in SVRG), or they require memory of the same size as the input problem (like SAGA). In this work, we propose k-SVRG, an algorithm that interpolates between those two extremes: it makes best use of the available memory and in turn does avoid full passes over the data without making progress. We prove linear convergence of k-SVRG on strongly convex problems and convergence to stationary points on non-convex problems. Numerical experiments show the effectiveness of our method.
- Europe > Germany > Baden-Württemberg > Tübingen Region > Tübingen (0.14)
- North America > United States > New York (0.04)
- Europe > Switzerland > Vaud > Lausanne (0.04)
- Asia > Middle East > Jordan (0.04)