Statistical Learning
Model-based functional mixture discriminant analysis with hidden process regression for curve classification
Chamroukhi, Faicel, Glotin, Hervé, Samé, Allou
In this paper, we study the modeling and the classification of functional data presenting regime changes over time. We propose a new model-based functional mixture discriminant analysis approach based on a specific hidden process regression model that governs the regime changes over time. Our approach is particularly adapted to handle the problem of complex-shaped classes of curves, where each class is potentially composed of several sub-classes, and to deal with the regime changes within each homogeneous sub-class. The proposed model explicitly integrates the heterogeneity of each class of curves via a mixture model formulation, and the regime changes within each sub-class through a hidden logistic process. Each class of complex-shaped curves is modeled by a finite number of homogeneous clusters, each of them being decomposed into several regimes. The model parameters of each class are learned by maximizing the observed-data log-likelihood by using a dedicated expectation-maximization (EM) algorithm. Comparisons are performed with alternative curve classification approaches, including functional linear discriminant analysis and functional mixture discriminant analysis with polynomial regression mixtures and spline regression mixtures. Results obtained on simulated data and real data show that the proposed approach outperforms the alternative approaches in terms of discrimination, and significantly improves the curves approximation.
Joint segmentation of multivariate time series with hidden process regression for human activity recognition
Chamroukhi, Faicel, Mohammed, Samer, Trabelsi, Dorra, Oukhellou, Latifa, Amirat, Yacine
The problem of human activity recognition is central for understanding and predicting the human behavior, in particular in a prospective of assistive services to humans, such as health monitoring, well being, security, etc. There is therefore a growing need to build accurate models which can take into account the variability of the human activities over time (dynamic models) rather than static ones which can have some limitations in such a dynamic context. In this paper, the problem of activity recognition is analyzed through the segmentation of the multidimensional time series of the acceleration data measured in the 3-d space using body-worn accelerometers. The proposed model for automatic temporal segmentation is a specific statistical latent process model which assumes that the observed acceleration sequence is governed by sequence of hidden (unobserved) activities. More specifically, the proposed approach is based on a specific multiple regression model incorporating a hidden discrete logistic process which governs the switching from one activity to another over time. The model is learned in an unsupervised context by maximizing the observed-data log-likelihood via a dedicated expectation-maximization (EM) algorithm. We applied it on a real-world automatic human activity recognition problem and its performance was assessed by performing comparisons with alternative approaches, including well-known supervised static classifiers and the standard hidden Markov model (HMM). The obtained results are very encouraging and show that the proposed approach is quite competitive even it works in an entirely unsupervised way and does not requires a feature extraction preprocessing step.
Clustering for high-dimension, low-sample size data using distance vectors
In high-dimension, low-sample size (HDLSS) data, it is not always true that closeness of two objects reflects a hidden cluster structure. We point out the important fact that it is not the closeness, but the "values" of distance that contain information of the cluster structure in high-dimensional space. Based on this fact, we propose an efficient and simple clustering approach, called distance vector clustering, for HDLSS data. Under the assumptions given in the work of Hall et al. (2005), we show the proposed approach provides a true cluster label under milder conditions when the dimension tends to infinity with the sample size fixed. The effectiveness of the distance vector clustering approach is illustrated through a numerical experiment and real data analysis.
Robust EM algorithm for model-based curve clustering
Model-based clustering approaches concern the paradigm of exploratory data analysis relying on the finite mixture model to automatically find a latent structure governing observed data. They are one of the most popular and successful approaches in cluster analysis. The mixture density estimation is generally performed by maximizing the observed-data log-likelihood by using the expectation-maximization (EM) algorithm. However, it is well-known that the EM algorithm initialization is crucial. In addition, the standard EM algorithm requires the number of clusters to be known a priori. Some solutions have been provided in [31, 12] for model-based clustering with Gaussian mixture models for multivariate data. In this paper we focus on model-based curve clustering approaches, when the data are curves rather than vectorial data, based on regression mixtures. We propose a new robust EM algorithm for clustering curves. We extend the model-based clustering approach presented in [31] for Gaussian mixture models, to the case of curve clustering by regression mixtures, including polynomial regression mixtures as well as spline or B-spline regressions mixtures. Our approach both handles the problem of initialization and the one of choosing the optimal number of clusters as the EM learning proceeds, rather than in a two-fold scheme. This is achieved by optimizing a penalized log-likelihood criterion. A simulation study confirms the potential benefit of the proposed algorithm in terms of robustness regarding initialization and funding the actual number of clusters.
The Influence of Global Constraints on Similarity Measures for Time-Series Databases
Kurbalija, Vladimir, Radovanović, Miloš, Geler, Zoltan, Ivanović, Mirjana
A time series consists of a series of values or events obtained over repeated measurements in time. Analysis of time series represents and important tool in many application areas, such as stock market analysis, process and quality control, observation of natural phenomena, medical treatments, etc. A vital component in many types of time-series analysis is the choice of an appropriate distance/similarity measure. Numerous measures have been proposed to date, with the most successful ones based on dynamic programming. Being of quadratic time complexity, however, global constraints are often employed to limit the search space in the matrix during the dynamic programming procedure, in order to speed up computation. Furthermore, it has been reported that such constrained measures can also achieve better accuracy. In this paper, we investigate two representative time-series distance/similarity measures based on dynamic programming, Dynamic Time Warping (DTW) and Longest Common Subsequence (LCS), and the effects of global constraints on them. Through extensive experiments on a large number of time-series data sets, we demonstrate how global constrains can significantly reduce the computation time of DTW and LCS. We also show that, if the constraint parameter is tight enough (less than 10-15% of time-series length), the constrained measure becomes significantly different from its unconstrained counterpart, in the sense of producing qualitatively different 1-nearest neighbor graphs. This observation explains the potential for accuracy gains when using constrained measures, highlighting the need for careful tuning of constraint parameters in order to achieve a good trade-off between speed and accuracy.
Outlier robust system identification: a Bayesian kernel-based approach
Bottegal, Giulio, Aravkin, Aleksandr Y., Hjalmarsson, Hakan, Pillonetto, Gianluigi
In this paper, we propose an outlier-robust regularized kernel-based method for linear system identification. The unknown impulse response is modeled as a zero-mean Gaussian process whose covariance (kernel) is given by the recently proposed stable spline kernel, which encodes information on regularity and exponential stability. To build robustness to outliers, we model the measurement noise as realizations of independent Laplacian random variables. The identification problem is cast in a Bayesian framework, and solved by a new Markov Chain Monte Carlo (MCMC) scheme. In particular, exploiting the representation of the Laplacian random variables as scale mixtures of Gaussians, we design a Gibbs sampler which quickly converges to the target distribution. Numerical simulations show a substantial improvement in the accuracy of the estimates over state-of-the-art kernel-based methods.
A central limit theorem for scaled eigenvectors of random dot product graphs
Athreya, Avanti, Lyzinski, Vince, Marchette, David J., Priebe, Carey E., Sussman, Daniel L., Tang, Minh
Spectral analysis of the adjacency and Laplacian matrices for graphs is of both theoretical (Chung, 1997) and practical (Luxburg, 2007) significance. For instance, the spectrum can be used to characterize the number of connected components in a graph and various properties of random walks on graphs, and the eigenvector corresponding to the second smallest eigenvalue of the Laplacian is used in the solution to a relaxed version of the min-cut problem (Fiedler, 1973). In our current work, we investigate the second-order properties of the eigenvectors corresponding to the largest eigenvalues of the adjacency matrix of a random graph. In particular, we show that under the random dot product graph model (Young and Scheinerman, 2007), the components of the eigenvectors are asymptotically normal and centered around the true latent positions (see Section 4).
Using Latent Binary Variables for Online Reconstruction of Large Scale Systems
Martin, Victorin, Lasgouttes, Jean-Marc, Furtlehner, Cyril
We propose a probabilistic graphical model realizing a minimal encoding of real variables dependencies based on possibly incomplete observation and an empirical cumulative distribution function per variable. The target application is a large scale partially observed system, like e.g. a traffic network, where a small proportion of real valued variables are observed, and the other variables have to be predicted. Our design objective is therefore to have good scalability in a real-time setting. Instead of attempting to encode the dependencies of the system directly in the description space, we propose a way to encode them in a latent space of binary variables, reflecting a rough perception of the observable (congested/non-congested for a traffic road). The method relies in part on message passing algorithms, i.e. belief propagation, but the core of the work concerns the definition of meaningful latent variables associated to the variables of interest and their pairwise dependencies. Numerical experiments demonstrate the applicability of the method in practice.
Covariance Estimation in High Dimensions via Kronecker Product Expansions
Tsiligkaridis, Theodoros, Hero, Alfred O. III
This paper presents a new method for estimating high dimensional covariance matrices. The method, permuted rank-penalized least-squares (PRLS), is based on a Kronecker product series expansion of the true covariance matrix. Assuming an i.i.d. Gaussian random sample, we establish high dimensional rates of convergence to the true covariance as both the number of samples and the number of variables go to infinity. For covariance matrices of low separation rank, our results establish that PRLS has significantly faster convergence than the standard sample covariance matrix (SCM) estimator. The convergence rate captures a fundamental tradeoff between estimation error and approximation error, thus providing a scalable covariance estimation framework in terms of separation rank, similar to low rank approximation of covariance matrices. The MSE convergence rates generalize the high dimensional rates recently obtained for the ML Flip-flop algorithm for Kronecker product covariance estimation. We show that a class of block Toeplitz covariance matrices is approximatable by low separation rank and give bounds on the minimal separation rank $r$ that ensures a given level of bias. Simulations are presented to validate the theoretical bounds. As a real world application, we illustrate the utility of the proposed Kronecker covariance estimator for spatio-temporal linear least squares prediction of multivariate wind speed measurements.
Parallel architectures for fuzzy triadic similarity learning
Alouane-Ksouri, Sonia, Sassi-Hidri, Minyar, Barkaoui, Kamel
In a context of document co-clustering, we define a new similarity measure which iteratively computes similarity while combining fuzzy sets in a three-partite graph. The fuzzy triadic similarity (FT-Sim) model can deal with uncertainty offers by the fuzzy sets. Moreover, with the development of the Web and the high availability of storage spaces, more and more documents become accessible. Documents can be provided from multiple sites and make similarity computation an expensive processing. This problem motivated us to use parallel computing. In this paper, we introduce parallel architectures which are able to treat large and multi-source data sets by a sequential, a merging or a splitting-based process. Then, we proceed to a local and a central (or global) computing using the basic FT-Sim measure. The idea behind these architectures is to reduce both time and space complexities thanks to parallel computation.