Support Vector Machines
Random Warping Series: A Random Features Method for Time-Series Embedding
Wu, Lingfei, Yen, Ian En-Hsu, Yi, Jinfeng, Xu, Fangli, Lei, Qi, Witbrock, Michael
Time series data analytics has been a problem of substantial interests for decades, and Dynamic Time Warping (DTW) has been the most widely adopted technique to measure dissimilarity between time series. A number of global-alignment kernels have since been proposed in the spirit of DTW to extend its use to kernel-based estimation method such as support vector machine. However, those kernels suffer from diagonal dominance of the Gram matrix and a quadratic complexity w.r.t. the sample size. In this work, we study a family of alignment-aware positive definite (p.d.) kernels, with its feature embedding given by a distribution of \emph{Random Warping Series (RWS)}. The proposed kernel does not suffer from the issue of diagonal dominance while naturally enjoys a \emph{Random Features} (RF) approximation, which reduces the computational complexity of existing DTW-based techniques from quadratic to linear in terms of both the number and the length of time-series. We also study the convergence of the RF approximation for the domain of time series of unbounded length. Our extensive experiments on 16 benchmark datasets demonstrate that RWS outperforms or matches state-of-the-art classification and clustering methods in both accuracy and computational time. Our code and data is available at { \url{https://github.com/IBM/RandomWarpingSeries}}.
Efficient Structured Surrogate Loss and Regularization in Structured Prediction
In this dissertation, we focus on several important problems in structured prediction. In structured prediction, the label has a rich intrinsic substructure, and the loss varies with respect to the predicted label and the true label pair. Structured SVM is an extension of binary SVM to adapt to such structured tasks. In the first part of the dissertation, we study the surrogate losses and its efficient methods. To minimize the empirical risk, a surrogate loss which upper bounds the loss, is used as a proxy to minimize the actual loss. Since the objective function is written in terms of the surrogate loss, the choice of the surrogate loss is important, and the performance depends on it. Another issue regarding the surrogate loss is the efficiency of the argmax label inference for the surrogate loss. Efficient inference is necessary for the optimization since it is often the most time-consuming step. We present a new class of surrogate losses named bi-criteria surrogate loss, which is a generalization of the popular surrogate losses. We first investigate an efficient method for a slack rescaling formulation as a starting point utilizing decomposability of the model. Then, we extend the algorithm to the bi-criteria surrogate loss, which is very efficient and also shows performance improvements. In the second part of the dissertation, another important issue of regularization is studied. Specifically, we investigate a problem of regularization in hierarchical classification when a structural imbalance exists in the label structure. We present a method to normalize the structure, as well as a new norm, namely shared Frobenius norm. It is suitable for hierarchical classification that adapts to the data in addition to the label structure.
But How Does It Work in Theory? Linear SVM with Random Features
Gilbert, Anna, Tewari, Ambuj, Sun, Yitong
We prove that, under low noise assumptions, the support vector machine with $N\ll m$ random features (RFSVM) can achieve the learning rate faster than $O(1/\sqrt{m})$ on a training set with $m$ samples when an optimized feature map is used. Our work extends the previous fast rate analysis of random features method from least square loss to 0-1 loss. We also show that the reweighted feature selection method, which approximates the optimized feature map, helps improve the performance of RFSVM in experiments on a synthetic data set.
Does Your Phone Know Your Touch?
Peruzzi, John, Wingard, Phillip Andrew, Zucker, David
In this paper, we consider the problem of distinguishing between authorized and unauthorized touchscreen and smart phone device users by leveraging a learned gesture classification profile combined with gesture anomaly detection. As touchscreen devices become more ubiquitous and the information stored within them becomes increasingly personal and valuable, the incentive and reward for circumventing existing security mechanisms has increased substantially. Given known vulnerabilities in existing biometric and non-biometric authentication methods such as fingerprint scanners, facial recognition, tokens, and pass codes; the development of an effective authentication approach that goes beyond the'something you know / have / are' paradigm is needed. In this paper, we propose models that accurately predict users based on touchscreen gesture patterns and detect anomalies in these patterns as a versatile approach to augment existing security methods and provide a method of continuous authentication. Touchscreen gesture are collected from a set of users from a capacitive sensor array to simulate a smart phone. Features include the pressure measured at the two dimensional (X,Y) coordinates on the sensor for each gesture, velocity at different instances of the gesture, and the duration of the gesture. We then demonstrate how logistic regression, support vector machines (SVM), and multiple Gaussian processes can be used to classify and predict the user creating the gesture. Our intent is to determine the extent to which supervised and unsupervised learning approaches can be successfully leveraged across multiple domains to limit the impact of unauthorized touchscreen device usage, quantify touchscreen security weaknesses and vulnerabilities, and potentially inform touchscreen device security design. Scenarios where our analysis may be useful include high security use cases where continuous authentication is required in the finance, transportation, public safety sectors.
An Efficient ADMM-Based Algorithm to Nonconvex Penalized Support Vector Machines
Guan, Lei, Qiao, Linbo, Li, Dongsheng, Sun, Tao, Ge, Keshi, Lu, Xicheng
Support vector machines (SVMs) with sparsity-inducing nonconvex penalties have received considerable attentions for the characteristics of automatic classification and variable selection. However, it is quite challenging to solve the nonconvex penalized SVMs due to their nondifferentiability, nonsmoothness and nonconvexity. In this paper, we propose an efficient ADMM-based algorithm to the nonconvex penalized SVMs. The proposed algorithm covers a large class of commonly used nonconvex regularization terms including the smooth clipped absolute deviation (SCAD) penalty, minimax concave penalty (MCP), log-sum penalty (LSP) and capped-$\ell_1$ penalty. The computational complexity analysis shows that the proposed algorithm enjoys low computational cost. Moreover, the convergence of the proposed algorithm is guaranteed. Extensive experimental evaluations on five benchmark datasets demonstrate the superior performance of the proposed algorithm to other three state-of-the-art approaches.
One-shot Learning for iEEG Seizure Detection Using End-to-end Binary Operations: Local Binary Patterns with Hyperdimensional Computing
Burrello, Alessio, Schindler, Kaspar, Benini, Luca, Rahimi, Abbas
This paper presents an efficient binarized algorithm for both learning and classification of human epileptic seizures from intracranial electroencephalography (iEEG). The algorithm combines local binary patterns with brain-inspired hyperdimensional computing to enable end-to-end learning and inference with binary operations. The algorithm first transforms iEEG time series from each electrode into local binary pattern codes. Then atomic high-dimensional binary vectors are used to construct composite representations of seizures across all electrodes. For the majority of our patients (10 out of 16), the algorithm quickly learns from one or two seizures (i.e., one-/few-shot learning) and perfectly generalizes on 27 further seizures. For other patients, the algorithm requires three to six seizures for learning. Overall, our algorithm surpasses the state-of-the-art methods for detecting 65 novel seizures with higher specificity and sensitivity, and lower memory footprint.
Large Scale Learning with Kre\u{\i}n Kernels
We extend the Nystr\"om method for low-rank approximation of positive definite Mercer kernels to approximation of indefinite kernel matrices. Our result is the first derivation of the approach that does not require the positive definiteness of the kernel function. Building on this result, we then devise highly scalable methods for learning in reproducing kernel Kre\u{\i}n spaces. The main motivation for our work comes from problems with structured representations (e.g., graphs, strings, time-series), where it is relatively easy to devise a pairwise (dis)similarity function based on intuition/knowledge of a domain expert. Such pairwise functions are typically not positive definite and it is often well beyond the expertise of practitioners to verify this condition. The proposed large scale approaches for learning in reproducing kernel Kre\u{\i}n spaces provide principled and theoretically well-founded means to tackle this class of problems. The effectiveness of the approaches is evaluated empirically using kernels defined on structured and vectorial data representations.
Knowledge Integrated Classifier Design Based on Utility Optimization
This paper proposes a systematic framework to design a classification model that yields a classifier which optimizes a utility function based on prior knowledge. Specifically, as the data size grows, we prove that the produced classifier asymptotically converges to the optimal classifier, an extended version of the Bayes rule, which maximizes the utility function. Therefore, we provide a meaningful theoretical interpretation for modeling with the knowledge incorporated. Our knowledge incorporation method allows domain experts to guide the classifier towards correctly classifying data that they think to be more significant.
An Analysis of Hierarchical Text Classification Using Word Embeddings
Stein, Roger A., Jaques, Patricia A., Valiati, Joao F.
Efficient distributed numerical word representation models (word embeddings) combined with modern machine learning algorithms have recently yielded considerable improvement on automatic document classification tasks. However, the effectiveness of such techniques has not been assessed for the hierarchical text classification (HTC) yet. This study investigates the application of those models and algorithms on this specific problem by means of experimentation and analysis. We trained classification models with prominent machine learning algorithm implementations---fastText, XGBoost, SVM, and Keras' CNN---and noticeable word embeddings generation methods---GloVe, word2vec, and fastText---with publicly available data and evaluated them with measures specifically appropriate for the hierarchical context. FastText achieved an ${}_{LCA}F_1$ of 0.893 on a single-labeled version of the RCV1 dataset. An analysis indicates that using word embeddings and its flavors is a very promising approach for HTC.
Automated bird sound recognition in realistic settings
Papadopoulos, Timos, Roberts, Stephen J., Willis, Katherine J.
We evaluated the effectiveness of an automated bird sound identification system in a situation that emulates a realistic, typical application. We trained classification algorithms on a crowd-sourced collection of bird audio recording data and restricted our training methods to be completely free of manual intervention. The approach is hence directly applicable to the analysis of multiple species collections, with labelling provided by crowd-sourced collection. We evaluated the performance of the bird sound recognition system on a realistic number of candidate classes, corresponding to real conditions. We investigated the use of two canonical classification methods, chosen due to their widespread use and ease of interpretation, namely a k Nearest Neighbour (kNN) classifier with histogram-based features and a Support Vector Machine (SVM) with time-summarisation features. We further investigated the use of a certainty measure, derived from the output probabilities of the classifiers, to enhance the interpretability and reliability of the class decisions. Our results demonstrate that both identification methods achieved similar performance, but we argue that the use of the kNN classifier offers somewhat more flexibility. Furthermore, we show that employing an outcome certainty measure provides a valuable and consistent indicator of the reliability of classification results. Our use of generic training data and our investigation of probabilistic classification methodologies that can flexibly address the variable number of candidate species/classes that are expected to be encountered in the field, directly contribute to the development of a practical bird sound identification system with potentially global application. Further, we show that certainty measures associated with identification outcomes can significantly contribute to the practical usability of the overall system.