Performance Analysis
Evaluating Post-Training Compression in GANs using Locality-Sensitive Hashing
Mordido, Gonçalo, Yang, Haojin, Meinel, Christoph
The analysis of the compression effects in generative adversarial networks (GANs) after training, i.e. without any fine-tuning, remains an unstudied, albeit important, topic with the increasing trend of their computation and memory requirements. While existing works discuss the difficulty of compressing GANs during training, requiring novel methods designed with the instability of GANs training in mind, we show that existing compression methods (namely clipping and quantization) may be directly applied to compress GANs post-training, without any additional changes. High compression levels may distort the generated set, likely leading to an increase of outliers that may negatively affect the overall assessment of existing k-nearest neighbor (KNN) based metrics. We propose two new precision and recall metrics based on locality-sensitive hashing (LSH), which, on top of increasing the outlier robustness, decrease the complexity of assessing an evaluation sample against $n$ reference samples from $O(n)$ to $O(\log(n))$, if using LSH and KNN, and to $O(1)$, if only applying LSH. We show that low-bit compression of several pre-trained GANs on multiple datasets induces a trade-off between precision and recall, retaining sample quality while sacrificing sample diversity.
Recognizing LTLf/PLTLf Goals in Fully Observable Non-Deterministic Domain Models
Pereira, Ramon Fraga, Fuggitti, Francesco, De Giacomo, Giuseppe
Goal Recognition is the task of discerning the correct intended goal that an agent aims to achieve, given a set of possible goals, a domain model, and a sequence of observations as a sample of the plan being executed in the environment. Existing approaches assume that the possible goals are formalized as a conjunction in deterministic settings. In this paper, we develop a novel approach that is capable of recognizing temporally extended goals in Fully Observable Non-Deterministic (FOND) planning domain models, focusing on goals on finite traces expressed in Linear Temporal Logic (LTLf) and (Pure) Past Linear Temporal Logic (PLTLf). We empirically evaluate our goal recognition approach using different LTLf and PLTLf goals over six common FOND planning domain models, and show that our approach is accurate to recognize temporally extended goals at several levels of observability.
Detecting Label Noise via Leave-One-Out Cross Validation
Tang, Yu-Hang, Zhu, Yuanran, de Jong, Wibe A.
We present a simple algorithm for identifying and correcting real-valued noisy labels from a mixture of clean and corrupted samples using Gaussian process regression. A heteroscedastic noise model is employed, in which additive Gaussian noise terms with independent variances are associated with each and all of the observed labels. Thus, the method effectively applies a sample-specific Tikhonov regularization term, generalizing the uniform regularization prevalent in standard Gaussian process regression. Optimizing the noise model using maximum likelihood estimation leads to the containment of the GPR model's predictive error by the posterior standard deviation in leave-one-out cross-validation. A multiplicative update scheme is proposed for solving the maximum likelihood estimation problem under non-negative constraints. While we provide a proof of monotonic convergence for certain special cases, the multiplicative scheme has empirically demonstrated monotonic convergence behavior in virtually all our numerical experiments. We show that the presented method can pinpoint corrupted samples and lead to better regression models when trained on synthetic and real-world scientific data sets.
Deep ROC Analysis and AUC as Balanced Average Accuracy to Improve Model Selection, Understanding and Interpretation
Carrington, André M., Manuel, Douglas G., Fieguth, Paul W., Ramsay, Tim, Osmani, Venet, Wernly, Bernhard, Bennett, Carol, Hawken, Steven, McInnes, Matthew, Magwood, Olivia, Sheikh, Yusuf, Holzinger, Andreas
Optimal performance is critical for decision-making tasks from medicine to autonomous driving, however common performance measures may be too general or too specific. For binary classifiers, diagnostic tests or prognosis at a timepoint, measures such as the area under the receiver operating characteristic curve, or the area under the precision recall curve, are too general because they include unrealistic decision thresholds. On the other hand, measures such as accuracy, sensitivity or the F1 score are measures at a single threshold that reflect an individual single probability or predicted risk, rather than a range of individuals or risk. We propose a method in between, deep ROC analysis, that examines groups of probabilities or predicted risks for more insightful analysis. We translate esoteric measures into familiar terms: AUC and the normalized concordant partial AUC are balanced average accuracy (a new finding); the normalized partial AUC is average sensitivity; and the normalized horizontal partial AUC is average specificity. Along with post-test measures, we provide a method that can improve model selection in some cases and provide interpretation and assurance for patients in each risk group. We demonstrate deep ROC analysis in two case studies and provide a toolkit in Python.
Interpretable Machine Learning: Fundamental Principles and 10 Grand Challenges
Rudin, Cynthia, Chen, Chaofan, Chen, Zhi, Huang, Haiyang, Semenova, Lesia, Zhong, Chudi
Interpretability in machine learning (ML) is crucial for high stakes decisions and troubleshooting. In this work, we provide fundamental principles for interpretable ML, and dispel common misunderstandings that dilute the importance of this crucial topic. We also identify 10 technical challenge areas in interpretable machine learning and provide history and background on each problem. Some of these problems are classically important, and some are recent problems that have arisen in the last few years. These problems are: (1) Optimizing sparse logical models such as decision trees; (2) Optimization of scoring systems; (3) Placing constraints into generalized additive models to encourage sparsity and better interpretability; (4) Modern case-based reasoning, including neural networks and matching for causal inference; (5) Complete supervised disentanglement of neural networks; (6) Complete or even partial unsupervised disentanglement of neural networks; (7) Dimensionality reduction for data visualization; (8) Machine learning models that can incorporate physics and other generative or causal constraints; (9) Characterization of the "Rashomon set" of good models; and (10) Interpretable reinforcement learning. This survey is suitable as a starting point for statisticians and computer scientists interested in working in interpretable machine learning.
Predicting Hard Drive Failure with Machine Learning - Datto Engineering Blog
We've all had a hard drive fail on us, and often it's as sudden as booting your machine and realizing you can't access a bunch of your files. It's especially not fun when you have an entire data center full of drives that are all important to keeping your business running. What if we could predict when one of those drives would fail, and get ahead of it by preemptively replacing the hardware before the data is lost? This is where the history of predictive drive failure at Datto begins. First and foremost, to make a prediction you need data. Hard drives have a built-in utility called SMART (Self-Monitoring, Analysis and Reporting Technology) that reports an array of statistics about how the drive is functioning. Here's an abbreviated view of what that looks like: Datto collects a report like this from each hard drive in its storage servers once per day. Each attribute in the report has three important numbers associated with it: value, thresh, and worst. Each attribute also has a feature named raw_value, but this is discarded due to inconsistent reporting standards between drive manufacturers. The value reflects how well the drive is operating with respect to the attribute, with 1 being the worst and 253 being the best. The initial value is arbitrarily determined by the manufacturer, and can vary by drive model. Thresh: A threshold below which the value should not fall in normal operation.
Empirical Analysis of Machine Learning Configurations for Prediction of Multiple Organ Failure in Trauma Patients
Wang, Yuqing, Zhao, Yun, Callcut, Rachael, Petzold, Linda
Multiple organ failure (MOF) is a life-threatening condition. Due to its urgency and high mortality rate, early detection is critical for clinicians to provide appropriate treatment. In this paper, we perform quantitative analysis on early MOF prediction with comprehensive machine learning (ML) configurations, including data preprocessing (missing value treatment, label balancing, feature scaling), feature selection, classifier choice, and hyperparameter tuning. Results show that classifier choice impacts both the performance improvement and variation most among all the configurations. In general, complex classifiers including ensemble methods can provide better performance than simple classifiers. However, blindly pursuing complex classifiers is unwise as it also brings the risk of greater performance variation.
The Case for High-Accuracy Classification: Think Small, Think Many!
Hosseini, Mohammad, Hasan, Mahmudul
To facilitate implementation of high-accuracy deep neural networks especially on resource-constrained devices, maintaining low computation requirements is crucial. Using very deep models for classification purposes not only decreases the neural network training speed and increases the inference time, but also need more data for higher prediction accuracy and to mitigate false positives. In this paper, we propose an efficient and lightweight deep classification ensemble structure based on a combination of simple color features, which is particularly designed for "high-accuracy" image classifications with low false positives. We designed, implemented, and evaluated our approach for explosion detection use-case applied to images and videos. Our evaluation results based on a large test test show considerable improvements on the prediction accuracy compared to the popular ResNet-50 model, while benefiting from 7.64x faster inference and lower computation cost. While we applied our approach to explosion detection, our approach is general and can be applied to other similar classification use cases as well. Given the insight gained from our experiments, we hence propose a "think small, think many" philosophy in classification scenarios: that transforming a single, large, monolithic deep model into a verification-based step model ensemble of multiple small, simple, lightweight models with narrowed-down color spaces can possibly lead to predictions with higher accuracy.
Evaluation of soccer team defense based on prediction models of ball recovery and being attacked
Toda, Kosuke, Teranishi, Masakiyo, Kushiro, Keisuke, Fujii, Keisuke
With the development of measurement technology, data on the movements of actual games in various sports are available and are expected to be used for planning and evaluating the tactics and strategy. In particular, defense in team sports is generally difficult to be evaluated because of the lack of statistical data. Conventional evaluation methods based on predictions of scores are considered unreliable and predict rare events throughout the entire game, and it is difficult to evaluate various plays leading up to a score. On the other hand, evaluation methods based on certain plays that lead to scoring and dominant regions are sometimes unsuitable to evaluate the performance (e.g., goals scored) of players and teams. In this study, we propose a method to evaluate team defense from a comprehensive perspective related to team performance based on the prediction of ball recovery and being attacked, which occur more frequently than goals, using player actions and positional data of all players and the ball. Using data from 45 soccer matches, we examined the relationship between the proposed index and team performance in actual matches and throughout a season. Results show that the proposed classifiers more accurately predicted the true events than the existing classifiers which were based on rare events (i.e., goals). Also, the proposed index had a moderate correlation with the long-term outcomes of the season. These results suggest that the proposed index might be a more reliable indicator rather than winning or losing with the inclusion of accidental factors.
Different Data Splitting Cross-Validation Strategies with Python
In this article, we will cover the cross-validation methods to split the data set uniformly to get good performance on prediction. We see how our data is splitting into the training set and testing set in our machine learning algorithm. But, if you ever tried to think that these two sets are enough to build the production model. From my point of view, we should include the validation set before we predict the test set. It is important because if the model gets overfit then we can tuning the hyperparameters after checking with the validation set and set the good parameter for our test set.