Performance Analysis
Calibrated Surrogate Maximization of Linear-fractional Utility in Binary Classification
Complex classification performance metrics such as the F${}_\beta$-measure and Jaccard index are often used, in order to handle class-imbalanced cases such as information retrieval and image segmentation. These performance metrics are not decomposable, that is, they cannot be expressed in a per-example manner, which hinders a straightforward application of the M-estimation widely used in supervised learning. In this paper, we consider \emph{linear-fractional metrics}, which are a family of classification performance metrics that encompasses many standard metrics such as the F${}_\beta$-measure and Jaccard index, and propose methods to directly maximize performances under those metrics. A clue to tackle their direct optimization is a \emph{calibrated surrogate utility}, which is a tractable lower bound of the true utility function representing a given metric. We characterize necessary conditions which make the surrogate maximization coincide with the maximization of the true utility. To the best of our knowledge, this is the first surrogate calibration analysis for the linear-fractional metrics. We also propose gradient-based optimization algorithms and show their practical usefulness in experiments.
Graph DNA: Deep Neighborhood Aware Graph Encoding for Collaborative Filtering
Wu, Liwei, Yu, Hsiang-Fu, Rao, Nikhil, Sharpnack, James, Hsieh, Cho-Jui
In this paper, we consider recommender systems with side information in the form of graphs. Existing collaborative filtering algorithms mainly utilize only immediate neighborhood information and have a hard time taking advantage of deeper neighborhoods beyond 1-2 hops. The main caveat of exploiting deeper graph information is the rapidly growing time and space complexity when incorporating information from these neighborhoods. In this paper, we propose using Graph DNA, a novel Deep Neighborhood Aware graph encoding algorithm, for exploiting deeper neighborhood information. DNA encoding computes approximate deep neighborhood information in linear time using Bloom filters, a space-efficient probabilistic data structure and results in a per-node encoding that is logarithmic in the number of nodes in the graph. It can be used in conjunction with both feature-based and graph-regularization-based collaborative filtering algorithms. Graph DNA has the advantages of being memory and time efficient and providing additional regularization when compared to directly using higher order graph information. We conduct experiments on real-world datasets, showing graph DNA can be easily used with 4 popular collaborative filtering algorithms and consistently leads to a performance boost with little computational and memory overhead.
Ultimate Power of Inference Attacks: Privacy Risks of High-Dimensional Models
Murakonda, Sasi Kumar, Shokri, Reza, Theodorakopoulos, George
Models leak information about their training data. This enables attackers to infer sensitive information about their training sets, notably determine if a data sample was part of the model's training set. The existing works empirically show the possibility of these tracing (membership inference) attacks against complex models with a large number of parameters. However, the attack results are dependent on the specific training data, can be obtained only after the tedious process of training the model and performing the attack, and are missing any measure of the confidence and unused potential power of the attack. A model designer is interested in identifying which model structures leak more information, how adding new parameters to the model increases its privacy risk, and what is the gain of adding new data points to decrease the overall information leakage. The privacy analysis should also enable designing the most powerful inference attack. In this paper, we design a theoretical framework to analyze the maximum power of tracing attacks against high-dimensional models, with the focus on probabilistic graphical models. We provide a tight upper-bound on the power (true positive rate) of these attacks, with respect to their error (false positive rate). The bound, as it should be, is independent of the knowledge and algorithm of any specific attack, as well as the values of particular samples in the training set. It provides a measure of the potential leakage of a model given its structure, as a function of the structure complexity and the size of training set.
Vector-Valued Graph Trend Filtering with Non-Convex Penalties
Varma, Rohan, Lee, Harlin, Kovačević, Jelena, Chi, Yuejie
We study the denoising of piecewise smooth graph signals that exhibit inhomogeneous levels of smoothness over a graph, where the value at each node can be vector-valued. We extend the graph trend filtering framework to denoising vector-valued graph signals with a family of non-convex regularizers that exhibit superior recovery performance over existing convex regularizers. We establish the statistical error rates of first-order stationary points of the proposed non-convex method for generic graphs using oracle inequalities. We further present an ADMM-based algorithm to solve the proposed method and analyze its convergence. We present numerical experiments on both synthetic and real-world data for denoising, support recovery, and semi-supervised classification.
Pre-training Graph Neural Networks
Hu, Weihua, Liu, Bowen, Gomes, Joseph, Zitnik, Marinka, Liang, Percy, Pande, Vijay, Leskovec, Jure
Many applications of machine learning in science and medicine, including molecular property and protein function prediction, can be cast as problems of predicting some properties of graphs, where having good graph representations is critical. However, two key challenges in these domains are (1) extreme scarcity of labeled data due to expensive lab experiments, and (2) needing to extrapolate to test graphs that are structurally different from those seen during training. In this paper, we explore pre-training to address both of these challenges. In particular, working with Graph Neural Networks (GNNs) for representation learning of graphs, we wish to obtain node representations that (1) capture similarity of nodes' network neighborhood structure, (2) can be composed to give accurate graph-level representations, and (3) capture domain-knowledge. To achieve these goals, we propose a series of methods to pre-train GNNs at both the node-level and the graph-level, using both unlabeled data and labeled data from related auxiliary supervised tasks. We perform extensive evaluation on two applications, molecular property and protein function prediction. We observe that performing only graph-level supervised pre-training often leads to marginal performance gain or even can worsen the performance compared to non-pre-trained models. On the other hand, effectively combining both node- and graph-level pre-training techniques significantly improves generalization to out-of-distribution graphs, consistently outperforming non-pre-trained GNNs across 8 datasets in molecular property prediction (resp. 40 tasks in protein function prediction), with the average ROC-AUC improvement of 7.2% (resp. 11.7%).
Fairness and Missing Values
Martínez-Plumed, Fernando, Ferri, Cèsar, Nieves, David, Hernández-Orallo, José
The causes underlying unfair decision making are complex, being internalised in different ways by decision makers, other actors dealing with data and models, and ultimately by the individuals being affected by these decisions. One frequent manifestation of all these latent causes arises in the form of missing values: protected groups are more reluctant to give information that could be used against them, delicate information for some groups can be erased by human operators, or data acquisition may simply be less complete and systematic for minority groups. As a result, missing values and bias in data are two phenomena that are tightly coupled. However, most recent techniques, libraries and experimental results dealing with fairness in machine learning have simply ignored missing data. In this paper, we claim that fairness research should not miss the opportunity to deal properly with missing data. To support this claim, (1) we analyse the sources of missing data and bias, and we map the common causes, (2) we find that rows containing missing values are usually fairer than the rest, which should not be treated as the uncomfortable ugly data that different techniques and libraries get rid of at the first occasion, and (3) we study the trade-off between performance and fairness when the rows with missing values are used (either because the technique deals with them directly or by imputation methods). We end the paper with a series of recommended procedures about what to do with missing data when aiming for fair decision making.
Harnessing Slow Dynamics in Neuromorphic Computation
Neuromorphic Computing is a nascent research field in which models and devices are designed to process information by emulating biological neural systems. Thanks to their superior energy efficiency, analog neuromorphic systems are highly promising for embedded, wearable, and implantable systems. However, optimizing neural networks deployed on these systems is challenging. One main challenge is the so-called timescale mismatch: Dynamics of analog circuits tend to be too fast to process real-time sensory inputs. In this thesis, we propose a few working solutions to slow down dynamics of on-chip spiking neural networks. We empirically show that, by harnessing slow dynamics, spiking neural networks on analog neuromorphic systems can gain non-trivial performance boosts on a battery of real-time signal processing tasks.
Importance of user inputs while using incremental learning to personalize human activity recognition models
Siirtola, Pekka, Koskimäki, Heli, Röning, Juha
In this study, importance of user inputs is studied in the context of personalizing human activity recognition models using incremental learning. Inertial sensor data from three body positions are used, and the classification is based on Learn++ ensemble method. Three different approaches to update models are compared: non-supervised, semi-supervised and supervised. Non-supervised approach relies fully on predicted labels, supervised fully on user labeled data, and the proposed method for semi-supervised learning, is a combination of these two. In fact, our experiments show that by relying on predicted labels with high confidence, and asking the user to label only uncertain observations (from 12% to 26% of the observations depending on the used base classifier), almost as low error rates can be achieved as by using supervised approach. In fact, the difference was less than 2%-units. Moreover, unlike non-supervised approach, semi-supervised approach does not suffer from drastic concept drift, and thus, the error rate of the non-supervised approach is over 5%-units higher than using semi-supervised approach.
The Theory Behind Overfitting, Cross Validation, Regularization, Bagging, and Boosting: Tutorial
Ghojogh, Benyamin, Crowley, Mark
In this tutorial paper, we first define mean squared error, variance, covariance, and bias of both random variables and classification/predictor models. Then, we formulate the true and generalization errors of the model for both training and validation/test instances where we make use of the Stein's Unbiased Risk Estimator (SURE). We define overfitting, underfitting, and generalization using the obtained true and generalization errors. We introduce cross validation and two well-known examples which are $K$-fold and leave-one-out cross validations. We briefly introduce generalized cross validation and then move on to regularization where we use the SURE again. We work on both $\ell_2$ and $\ell_1$ norm regularizations. Then, we show that bootstrap aggregating (bagging) reduces the variance of estimation. Boosting, specifically AdaBoost, is introduced and it is explained as both an additive model and a maximum margin model, i.e., Support Vector Machine (SVM). The upper bound on the generalization error of boosting is also provided to show why boosting prevents from overfitting. As examples of regularization, the theory of ridge and lasso regressions, weight decay, noise injection to input/weights, and early stopping are explained. Random forest, dropout, histogram of oriented gradients, and single shot multi-box detector are explained as examples of bagging in machine learning and computer vision. Finally, boosting tree and SVM models are mentioned as examples of boosting.
Evaluating time series forecasting models: An empirical study on performance estimation methods
Cerqueira, Vitor, Torgo, Luis, Mozetic, Igor
Performance estimation aims at estimating the loss that a predictive model will incur on unseen data. These procedures are part of the pipeline in every machine learning project and are used for assessing the overall generalisation ability of predictive models. In this paper we address the application of these methods to time series forecasting tasks. For independent and identically distributed data the most common approach is cross-validation. However, the dependency among observations in time series raises some caveats about the most appropriate way to estimate performance in this type of data and currently there is no settled way to do so. We compare different variants of cross-validation and of out-of-sample approaches using two case studies: One with 62 real-world time series and another with three synthetic time series. Results show noticeable differences in the performance estimation methods in the two scenarios. In particular, empirical experiments suggest that cross-validation approaches can be applied to stationary time series. However, in real-world scenarios, when different sources of non-stationary variation are at play, the most accurate estimates are produced by out-of-sample methods that preserve the temporal order of observations.