Collaborating Authors


Statistical embedding: Beyond principal components Machine Learning

There has been an intense recent activity in embedding of very high dimensional and nonlinear data structures, much of it in the data science and machine learning literature. We survey this activity in four parts. In the first part we cover nonlinear methods such as principal curves, multidimensional scaling, local linear methods, ISOMAP, graph based methods and kernel based methods. The second part is concerned with topological embedding methods, in particular mapping topological properties into persistence diagrams. Another type of data sets with a tremendous growth is very high-dimensional network data. The task considered in part three is how to embed such data in a vector space of moderate dimension to make the data amenable to traditional techniques such as cluster and classification techniques. The final part of the survey deals with embedding in $\mathbb{R}^2$, which is visualization. Three methods are presented: $t$-SNE, UMAP and LargeVis based on methods in parts one, two and three, respectively. The methods are illustrated and compared on two simulated data sets; one consisting of a triple of noisy Ranunculoid curves, and one consisting of networks of increasing complexity and with two types of nodes.

Tensor decomposition for learning Gaussian mixtures from moments Machine Learning

In data processing and machine learning, an important challenge is to recover and exploit models that can represent accurately the data. We consider the problem of recovering Gaussian mixture models from datasets. We investigate symmetric tensor decomposition methods for tackling this problem, where the tensor is built from empirical moments of the data distribution. We consider identifiable tensors, which have a unique decomposition, showing that moment tensors built from spherical Gaussian mixtures have this property. We prove that symmetric tensors with interpolation degree strictly less than half their order are identifiable and we present an algorithm, based on simple linear algebra operations, to compute their decomposition. Illustrative experimentations show the impact of the tensor decomposition method for recovering Gaussian mixtures, in comparison with other state-of-the-art approaches.

Review of Low-Voltage Load Forecasting: Methods, Applications, and Recommendations Machine Learning

The increased digitalisation and monitoring of the energy system opens up numerous opportunities % and solutions which can help to decarbonise the energy system. Applications on low voltage (LV), localised networks, such as community energy markets and smart storage will facilitate decarbonisation, but they will require advanced control and management. Reliable forecasting will be a necessary component of many of these systems to anticipate key features and uncertainties. Despite this urgent need, there has not yet been an extensive investigation into the current state-of-the-art of low voltage level forecasts, other than at the smart meter level. This paper aims to provide a comprehensive overview of the landscape, current approaches, core applications, challenges and recommendations. Another aim of this paper is to facilitate the continued improvement and advancement in this area. To this end, the paper also surveys some of the most relevant and promising trends. It establishes an open, community-driven list of the known LV level open datasets to encourage further research and development.

Spectral Methods for Data Science: A Statistical Perspective Machine Learning

Spectral methods have emerged as a simple yet surprisingly effective approach for extracting information from massive, noisy and incomplete data. In a nutshell, spectral methods refer to a collection of algorithms built upon the eigenvalues (resp. singular values) and eigenvectors (resp. singular vectors) of some properly designed matrices constructed from data. A diverse array of applications have been found in machine learning, data science, and signal processing. Due to their simplicity and effectiveness, spectral methods are not only used as a stand-alone estimator, but also frequently employed to initialize other more sophisticated algorithms to improve performance. While the studies of spectral methods can be traced back to classical matrix perturbation theory and methods of moments, the past decade has witnessed tremendous theoretical advances in demystifying their efficacy through the lens of statistical modeling, with the aid of non-asymptotic random matrix theory. This monograph aims to present a systematic, comprehensive, yet accessible introduction to spectral methods from a modern statistical perspective, highlighting their algorithmic implications in diverse large-scale applications. In particular, our exposition gravitates around several central questions that span various applications: how to characterize the sample efficiency of spectral methods in reaching a target level of statistical accuracy, and how to assess their stability in the face of random noise, missing data, and adversarial corruptions? In addition to conventional $\ell_2$ perturbation analysis, we present a systematic $\ell_{\infty}$ and $\ell_{2,\infty}$ perturbation theory for eigenspace and singular subspaces, which has only recently become available owing to a powerful "leave-one-out" analysis framework.

Spectral clustering via adaptive layer aggregation for multi-layer networks Machine Learning

One of the fundamental problems in network analysis is detecting community structure in multi-layer networks, of which each layer represents one type of edge information among the nodes. We propose integrative spectral clustering approaches based on effective convex layer aggregations. Our aggregation methods are strongly motivated by a delicate asymptotic analysis of the spectral embedding of weighted adjacency matrices and the downstream $k$-means clustering, in a challenging regime where community detection consistency is impossible. In fact, the methods are shown to estimate the optimal convex aggregation, which minimizes the mis-clustering error under some specialized multi-layer network models. Our analysis further suggests that clustering using Gaussian mixture models is generally superior to the commonly used $k$-means in spectral clustering. Extensive numerical studies demonstrate that our adaptive aggregation techniques, together with Gaussian mixture model clustering, make the new spectral clustering remarkably competitive compared to several popularly used methods.

New Algorithms And Fast Implementations To Approximate Stochastic Processes Machine Learning

We present new algorithms and fast implementations to find efficient approximations for modelling stochastic processes. For many numerical computations it is essential to develop finite approximations for stochastic processes. While the goal is always to find a finite model, which represents a given knowledge about the real data process as accurate as possible, the ways of estimating the discrete approximating model may be quite different: (i) if the stochastic model is known as a solution of a stochastic differential equation, e.g., one may generate the scenario tree directly from the specified model; (ii) if a simulation algorithm is available, which allows simulating trajectories from all conditional distributions, a scenario tree can be generated by stochastic approximation; (iii) if only some observed trajectories of the scenario process are available, the construction of the approximating process can be based on non-parametric conditional density estimates.

A contribution to Optimal Transport on incomparable spaces Machine Learning

Optimal Transport is a theory that allows to define geometrical notions of distance between probability distributions and to find correspondences, relationships, between sets of points. Many machine learning applications are derived from this theory, at the frontier between mathematics and optimization. This thesis proposes to study the complex scenario in which the different data belong to incomparable spaces. In particular we address the following questions: how to define and apply Optimal Transport between graphs, between structured data? How can it be adapted when the data are varied and not embedded in the same metric space? This thesis proposes a set of Optimal Transport tools for these different cases. An important part is notably devoted to the study of the Gromov-Wasserstein distance whose properties allow to define interesting transport problems on incomparable spaces. More broadly, we analyze the mathematical properties of the various proposed tools, we establish algorithmic solutions to compute them and we study their applicability in numerous machine learning scenarii which cover, in particular, classification, simplification, partitioning of structured data, as well as heterogeneous domain adaptation.

Kernel Smoothing, Mean Shift, and Their Learning Theory with Directional Data Machine Learning

Directional data consist of observations distributed on a (hyper)sphere, and appear in many applied fields, such as astronomy, ecology, and environmental science. This paper studies both statistical and computational problems of kernel smoothing for directional data. We generalize the classical mean shift algorithm to directional data, which allows us to identify local modes of the directional kernel density estimator (KDE). The statistical convergence rates of the directional KDE and its derivatives are derived, and the problem of mode estimation is examined. We also prove the ascending property of our directional mean shift algorithm and investigate a general problem of gradient ascent on the unit hypersphere. To demonstrate the applicability of our proposed algorithm, we evaluate it as a mode clustering method on both simulated and real-world datasets.

Unsupervised learning for vascular heterogeneity assessment of glioblastoma based on magnetic resonance imaging: The Hemodynamic Tissue Signature Artificial Intelligence

This thesis focuses on the research and development of the Hemodynamic Tissue Signature (HTS) method: an unsupervised machine learning approach to describe the vascular heterogeneity of glioblastomas by means of perfusion MRI analysis. The HTS builds on the concept of habitats. An habitat is defined as a sub-region of the lesion with a particular MRI profile describing a specific physiological behavior. The HTS method delineates four habitats within the glioblastoma: the High Angiogenic Tumor (HAT) habitat, as the most perfused region of the enhancing tumor; the Low Angiogenic Tumor (LAT) habitat, as the region of the enhancing tumor with a lower angiogenic profile; the potentially Infiltrated Peripheral Edema (IPE) habitat, as the non-enhancing region adjacent to the tumor with elevated perfusion indexes; and the Vasogenic Peripheral Edema (VPE) habitat, as the remaining edema of the lesion with the lowest perfusion profile. The results of this thesis have been published in ten scientific contributions, including top-ranked journals and conferences in the areas of Medical Informatics, Statistics and Probability, Radiology & Nuclear Medicine, Machine Learning and Data Mining and Biomedical Engineering. An industrial patent registered in Spain (ES201431289A), Europe (EP3190542A1) and EEUU (US20170287133A1) was also issued, summarizing the efforts of the thesis to generate tangible assets besides the academic revenue obtained from research publications. Finally, the methods, technologies and original ideas conceived in this thesis led to the foundation of ONCOANALYTICS CDX, a company framed into the business model of companion diagnostics for pharmaceutical compounds, thought as a vehicle to facilitate the industrialization of the ONCOhabitats technology.

InClass Nets: Independent Classifier Networks for Nonparametric Estimation of Conditional Independence Mixture Models and Unsupervised Classification Machine Learning

We introduce a new machine-learning-based approach, which we call the Independent Classifier networks (InClass nets) technique, for the nonparameteric estimation of conditional independence mixture models (CIMMs). We approach the estimation of a CIMM as a multi-class classification problem, since dividing the dataset into different categories naturally leads to the estimation of the mixture model. InClass nets consist of multiple independent classifier neural networks (NNs), each of which handles one of the variates of the CIMM. Fitting the CIMM to the data is performed by simultaneously training the individual NNs using suitable cost functions. The ability of NNs to approximate arbitrary functions makes our technique nonparametric. Further leveraging the power of NNs, we allow the conditionally independent variates of the model to be individually high-dimensional, which is the main advantage of our technique over existing non-machine-learning-based approaches. We derive some new results on the nonparametric identifiability of bivariate CIMMs, in the form of a necessary and a (different) sufficient condition for a bivariate CIMM to be identifiable. We provide a public implementation of InClass nets as a Python package called RainDancesVI and validate our InClass nets technique with several worked out examples. Our method also has applications in unsupervised and semi-supervised classification problems.