Goto

Collaborating Authors

 Regression


Synergy between Observation Systems Oceanic in Turbulent Regions

arXiv.org Artificial Intelligence

Ocean dynamics constitute a source of incertitude in determining the ocean's role in complex climatic phenomena. Current observation systems have difficulty achieving sufficiently statistic precision for three-dimensional oceanic data. It is crucial knowledge to describe the behavior of internal ocean structures. We present a data-driven approach that explores latent class regressions and deep neural networks in modeling ocean dynamics in the extensions of Gulf Stream and Kuroshio currents. The obtained results show a promising direction of data-driven for understanding the ocean's characteristics (salinity, temperature) in both spatial and temporal dimensions in the turbulent regions. Our source codes are publicly available at https://github.com/v18nguye/gulfstream-lrm and at https://github.com/sagudelor/Kuroshio.


Random Planted Forest: a directly interpretable tree ensemble

arXiv.org Machine Learning

We introduce a novel interpretable and tree-based algorithm for prediction in a regression setting in which each tree in a classical random forest is replaced by a family of planted trees that grow simultaneously. The motivation for our algorithm is to estimate the unknown regression function from a functional ANOVA decomposition perspective, where each tree corresponds to a function within that decomposition. Therefore, planted trees are limited in the number of interaction terms. The maximal order of approximation in the ANOVA decomposition can be specified or left unlimited. If a first order approximation is chosen, the result is an additive model. In the other extreme case, if the order of approximation is not limited, the resulting model puts no restrictions on the form of the regression function. In a simulation study we find encouraging prediction and visualisation properties of our random planted forest method. We also develop theory for an idealised version of random planted forests in the case of an underlying additive model. We show that in the additive case, the idealised version achieves up to a logarithmic factor asymptotically optimal one-dimensional convergence rates of order $n^{-2/5}$.


Multivariate Smoothing via the Fourier Integral Theorem and Fourier Kernel

arXiv.org Machine Learning

Rates of convergence are established and, in many cases, provide superior rates to current standard estimators such as those based on kernels, including kernel density estimators and kernel regression functions.


Adaptive Graph-based Generalized Regression Model for Unsupervised Feature Selection

arXiv.org Machine Learning

Unsupervised feature selection is an important method to reduce dimensions of high dimensional data without labels, which is benefit to avoid ``curse of dimensionality'' and improve the performance of subsequent machine learning tasks, like clustering and retrieval. How to select the uncorrelated and discriminative features is the key problem of unsupervised feature selection. Many proposed methods select features with strong discriminant and high redundancy, or vice versa. However, they only satisfy one of these two criteria. Other existing methods choose the discriminative features with low redundancy by constructing the graph matrix on the original feature space. Since the original feature space usually contains redundancy and noise, it will degrade the performance of feature selection. In order to address these issues, we first present a novel generalized regression model imposed by an uncorrelated constraint and the $\ell_{2,1}$-norm regularization. It can simultaneously select the uncorrelated and discriminative features as well as reduce the variance of these data points belonging to the same neighborhood, which is help for the clustering task. Furthermore, the local intrinsic structure of data is constructed on the reduced dimensional space by learning the similarity-induced graph adaptively. Then the learnings of the graph structure and the indicator matrix based on the spectral analysis are integrated into the generalized regression model. Finally, we develop an alternative iterative optimization algorithm to solve the objective function. A series of experiments are carried out on nine real-world data sets to demonstrate the effectiveness of the proposed method in comparison with other competing approaches.


Social media data reveals signal for public consumer perceptions

arXiv.org Artificial Intelligence

Researchers have used social media data to estimate various macroeconomic indicators about public behaviors, mostly as a way to reduce surveying costs. One of the most widely cited economic indicator is consumer confidence index (CCI). Numerous studies in the past have focused on using social media, especially Twitter data, to predict CCI. However, the strong correlations disappeared when those models were tested with newer data according to a recent comprehensive survey. In this work, we revisit this problem of assessing the true potential of using social media data to measure CCI, by proposing a robust non-parametric Bayesian modeling framework grounded in Gaussian Process Regression (which provides both an estimate and an uncertainty associated with it). Integral to our framework is a principled experimentation methodology that demonstrates how digital data can be employed to reduce the frequency of surveys, and thus periodic polling would be needed only to calibrate our model. Via extensive experimentation we show how the choice of different micro-decisions, such as the smoothing interval, various types of lags etc. have an important bearing on the results. By using decadal data (2008-2019) from Reddit, we show that both monthly and daily estimates of CCI can, indeed, be reliably estimated at least several months in advance, and that our model estimates are far superior to those generated by the existing methods.


Learning Structures in Earth Observation Data with Gaussian Processes

arXiv.org Machine Learning

Gaussian Processes (GPs) has experienced tremendous success in geoscience in general and for bio-geophysical parameter retrieval in the last years. GPs constitute a solid Bayesian framework to formulate many function approximation problems consistently. This paper reviews the main theoretical GP developments in the field. We review new algorithms that respect the signal and noise characteristics, that provide feature rankings automatically, and that allow applicability of associated uncertainty intervals to transport GP models in space and time. All these developments are illustrated in the field of geoscience and remote sensing at a local and global scales through a set of illustrative examples.


Global Models for Time Series Forecasting: A Simulation Study

arXiv.org Machine Learning

In the current context of Big Data, the nature of many forecasting problems has changed from predicting isolated time series to predicting many time series from similar sources. This has opened up the opportunity to develop competitive global forecasting models that simultaneously learn from many time series. But, it still remains unclear when global forecasting models can outperform the univariate benchmarks, especially along the dimensions of the homogeneity/heterogeneity of series, the complexity of patterns in the series, the complexity of forecasting models, and the lengths/number of series. Our study attempts to address this problem through investigating the effect from these factors, by simulating a number of datasets that have controllable time series characteristics. Specifically, we simulate time series from simple data generating processes (DGP), such as Auto Regressive (AR) and Seasonal AR, to complex DGPs, such as Chaotic Logistic Map, Self-Exciting Threshold Auto-Regressive, and Mackey-Glass Equations. The data heterogeneity is introduced by mixing time series generated from several DGPs into a single dataset. The lengths and the number of series in the dataset are varied in different scenarios. We perform experiments on these datasets using global forecasting models including Recurrent Neural Networks (RNN), Feed-Forward Neural Networks, Pooled Regression (PR) models and Light Gradient Boosting Models (LGBM), and compare their performance against standard statistical univariate forecasting techniques. Our experiments demonstrate that when trained as global forecasting models, techniques such as RNNs and LGBMs, which have complex non-linear modelling capabilities, are competitive methods in general under challenging forecasting scenarios such as series having short lengths, datasets with heterogeneous series and having minimal prior knowledge of the patterns of the series.


Can I Still Trust You?: Understanding the Impact of Distribution Shifts on Algorithmic Recourses

arXiv.org Artificial Intelligence

As predictive models are being increasingly deployed to make a variety of consequential decisions ranging from hiring decisions to loan approvals, there is growing emphasis on designing algorithms that can provide reliable recourses to affected individuals. To this end, several recourse generation algorithms have been proposed in recent literature. However, there is little to no work on systematically assessing if these algorithms are actually generating recourses that are reliable. In this work, we assess the reliability of algorithmic recourses through the lens of distribution shifts i.e., we empirically and theoretically study if and what kind of recourses generated by state-of-the-art algorithms are robust to distribution shifts. To the best of our knowledge, this work makes the first attempt at addressing this critical question. We experiment with multiple synthetic and real world datasets capturing different kinds of distribution shifts including temporal shifts, geospatial shifts, and shifts due to data corrections. Our results demonstrate that all the aforementioned distribution shifts could potentially invalidate the recourses generated by state-of-the-art algorithms. In addition, we also find that recourse interventions themselves may cause distribution shifts which in turn invalidate previously prescribed recourses. Our theoretical results establish that the recourses (counterfactuals) that are close to the model decision boundary are more likely to be invalidated upon model updation. However, state-of-the-art algorithms tend to prefer exactly these recourses because their cost functions penalize recourses (counterfactuals) that require large modifications to the original instance. Our findings not only expose fundamental flaws in recourse finding strategies but also pave new way for rethinking the design and development of recourse generation algorithms.


Ordered Counterfactual Explanation by Mixed-Integer Linear Optimization

arXiv.org Machine Learning

Post-hoc explanation methods for machine learning models have been widely used to support decision-making. One of the popular methods is Counterfactual Explanation (CE), which provides a user with a perturbation vector of features that alters the prediction result. Given a perturbation vector, a user can interpret it as an "action" for obtaining one's desired decision result. In practice, however, showing only a perturbation vector is often insufficient for users to execute the action. The reason is that if there is an asymmetric interaction among features, such as causality, the total cost of the action is expected to depend on the order of changing features. Therefore, practical CE methods are required to provide an appropriate order of changing features in addition to a perturbation vector. For this purpose, we propose a new framework called Ordered Counterfactual Explanation (OrdCE). We introduce a new objective function that evaluates a pair of an action and an order based on feature interaction. To extract an optimal pair, we propose a mixed-integer linear optimization approach with our objective function. Numerical experiments on real datasets demonstrated the effectiveness of our OrdCE in comparison with unordered CE methods.


(Decision and regression) tree ensemble based kernels for regression and classification

arXiv.org Machine Learning

Tree based ensembles such as Breiman's random forest (RF) and Gradient Boosted Trees (GBT) can be interpreted as implicit kernel generators, where the ensuing proximity matrix represents the data-driven tree ensemble kernel. Kernel perspective on the RF has been used to develop a principled framework for theoretical investigation of its statistical properties. Recently, it has been shown that the kernel interpretation is germane to other tree-based ensembles e.g. GBTs. However, practical utility of the links between kernels and the tree ensembles has not been widely explored and systematically evaluated. Focus of our work is investigation of the interplay between kernel methods and the tree based ensembles including the RF and GBT. We elucidate the performance and properties of the RF and GBT based kernels in a comprehensive simulation study comprising of continuous and binary targets. We show that for continuous targets, the RF/GBT kernels are competitive to their respective ensembles in higher dimensional scenarios, particularly in cases with larger number of noisy features. For the binary target, the RF/GBT kernels and their respective ensembles exhibit comparable performance. We provide the results from real life data sets for regression and classification to show how these insights may be leveraged in practice. Overall, our results support the tree ensemble based kernels as a valuable addition to the practitioner's toolbox. Finally, we discuss extensions of the tree ensemble based kernels for survival targets, interpretable prototype and landmarking classification and regression. We outline future line of research for kernels furnished by Bayesian counterparts of the frequentist tree ensembles.