Estimating the joint probability mass function (PMF) of a set of random variables lies at the heart of statistical learning and signal processing. Without structural assumptions, such as modeling the variables as a Markov chain, tree, or other graphical model, joint PMF estimation is often considered mission impossible - the number of unknowns grows exponentially with the number of variables. But who gives us the structural model? Is there a generic, 'non-parametric' way to control joint PMF complexity without relying on a priori structural assumptions regarding the underlying probability model? Is it possible to discover the operational structure without biasing the analysis up front? What if we only observe random subsets of the variables, can we still reliably estimate the joint PMF of all? This paper shows, perhaps surprisingly, that if the joint PMF of any three variables can be estimated, then the joint PMF of all the variables can be provably recovered under relatively mild conditions. The result is reminiscent of Kolmogorov's extension theorem - consistent specification of lower-order distributions induces a unique probability measure for the entire process. The difference is that for processes of limited complexity (rank of the high-order PMF) it is possible to obtain complete characterization from only third-order distributions. In fact not all third order PMFs are needed; and under more stringent conditions even second-order will do. Exploiting multilinear (tensor) algebra, this paper proves that such higher-order PMF completion can be guaranteed - several pertinent identifiability results are derived. It also provides a practical and efficient algorithm to carry out the recovery task. Judiciously designed simulations and real-data experiments on movie recommendation and data classification are presented to showcase the effectiveness of the approach.
Tensor completion is a problem of filling the missing or unobserved entries of partially observed tensors. Due to the multidimensional character of tensors in describing complex datasets, tensor completion algorithms and their applications have received wide attention and achievement in data mining, computer vision, signal processing, and neuroscience, etc. In this survey, we provide a modern overview of recent advances in tensor completion algorithms from the perspective of big data analytics characterized by diverse variety, large volume, and high velocity. Towards a better comprehension and comparison of vast existing advances, we summarize and categorize them into four groups including general tensor completion algorithms, tensor completion with auxiliary information (variety), scalable tensor completion algorithms (volume) and dynamic tensor completion algorithms (velocity). Besides, we introduce their applications on real-world data-driven problems and present an open-source package covering several widely used tensor decomposition and completion algorithms. Our goal is to summarize these popular methods and introduce them to researchers for promoting the research process in this field and give an available repository for practitioners. In the end, we also discuss some challenges and promising research directions in this community for future explorations.
There has recently been considerable interest in completing a low-rank matrix or tensor given only a small fraction (or few linear combinations) of its entries. Related approaches have found considerable success in the area of recommender systems, under machine learning. From a statistical estimation point of view, the gold standard is to have access to the joint probability distribution of all pertinent random variables, from which any desired optimal estimator can be readily derived. In practice high-dimensional joint distributions are very hard to estimate, and only estimates of low-dimensional projections may be available. We show that it is possible to identify higher-order joint PMFs from lower-order marginalized PMFs using coupled low-rank tensor factorization. Our approach features guaranteed identifiability when the full joint PMF is of low-enough rank, and effective approximation otherwise. We provide an algorithmic approach to compute the sought factors, and illustrate the merits of our approach using rating prediction as an example.
Directed networks are pervasive both in nature and engineered systems, often underlying the complex behavior observed in biological systems, microblogs and social interactions over the web, as well as global financial markets. Since their structures are often unobservable, in order to facilitate network analytics, one generally resorts to approaches capitalizing on measurable nodal processes to infer the unknown topology. Structural equation models (SEMs) are capable of incorporating exogenous inputs to resolve inherent directional ambiguities. However, conventional SEMs assume full knowledge of exogenous inputs, which may not be readily available in some practical settings. The present paper advocates a novel SEM-based topology inference approach that entails factorization of a three-way tensor, constructed from the observed nodal data, using the well-known parallel factor (PARAFAC) decomposition. It turns out that second-order piecewise stationary statistics of exogenous variables suffice to identify the hidden topology. Capitalizing on the uniqueness properties inherent to high-order tensor factorizations, it is shown that topology identification is possible under reasonably mild conditions. In addition, to facilitate real-time operation and inference of time-varying networks, an adaptive (PARAFAC) tensor decomposition scheme which tracks the topology-revealing tensor factors is developed. Extensive tests on simulated and real stock quote data demonstrate the merits of the novel tensor-based approach.
Streaming tensor factorization is a powerful tool for processing high-volume and multi-way temporal data in Internet networks, recommender systems and image/video data analysis. Existing streaming tensor factorization algorithms rely on least-squares data fitting and they do not possess a mechanism for tensor rank determination. This leaves them susceptible to outliers and vulnerable to over-fitting. This paper presents a Bayesian robust streaming tensor factorization model to identify sparse outliers, automatically determine the underlying tensor rank and accurately fit low-rank structure. We implement our model in Matlab and compare it with existing algorithms on tensor datasets generated from dynamic MRI and Internet traffic.