Goto

Collaborating Authors

 Réda, Clémence


Fast Iterative and Task-Specific Imputation with Online Learning

arXiv.org Artificial Intelligence

Missing feature values are a significant hurdle for downstream machine-learning tasks such as classification and regression. However, they are pervasive in multiple real-life use cases, for instance, in drug discovery research. Moreover, imputation methods might be time-consuming and offer few guarantees on the imputation quality, especially for not-missing-at-random mechanisms. We propose an imputation approach named F3I based on the iterative improvement of a K-nearest neighbor imputation that learns the weights for each neighbor of a data point, optimizing for the most likely distribution of points over data points. This algorithm can also be jointly trained with a downstream task on the imputed values. We provide a theoretical analysis of the imputation quality by F3I for several types of missing mechanisms. We also demonstrate the performance of F3I on both synthetic data sets and real-life drug repurposing and handwritten-digit recognition data.


Multivariate Functional Linear Discriminant Analysis for the Classification of Short Time Series with Missing Data

arXiv.org Artificial Intelligence

Functional linear discriminant analysis (FLDA) is a powerful tool that extends LDA-mediated multiclass classification and dimension reduction to univariate time-series functions. However, in the age of large multivariate and incomplete data, statistical dependencies between features must be estimated in a computationally tractable way, while also dealing with missing data. There is a need for a computationally tractable approach that considers the statistical dependencies between features and can handle missing values. We here develop a multivariate version of FLDA (MUDRA) to tackle this issue and describe an efficient expectation/conditional-maximization (ECM) algorithm to infer its parameters. We assess its predictive power on the "Articulary Word Recognition" data set and show its improvement over the state-of-the-art, especially in the case of missing data. MUDRA allows interpretable classification of data sets with large proportions of missing data, which will be particularly useful for medical or psychological data sets.


An Anytime Algorithm for Good Arm Identification

arXiv.org Machine Learning

In good arm identification (GAI), the goal is to identify one arm whose average performance exceeds a given threshold, referred to as good arm, if it exists. Few works have studied GAI in the fixed-budget setting, when the sampling budget is fixed beforehand, or the anytime setting, when a recommendation can be asked at any time. We propose APGAI, an anytime and parameter-free sampling rule for GAI in stochastic bandits. APGAI can be straightforwardly used in fixed-confidence and fixed-budget settings. First, we derive upper bounds on its probability of error at any time. They show that adaptive strategies are more efficient in detecting the absence of good arms than uniform sampling. Second, when APGAI is combined with a stopping rule, we prove upper bounds on the expected sampling complexity, holding at any confidence level. Finally, we show good empirical performance of APGAI on synthetic and real-world data. Our work offers an extensive overview of the GAI problem in all settings.


Dealing With Misspecification In Fixed-Confidence Linear Top-m Identification

arXiv.org Artificial Intelligence

We study the problem of the identification of m arms with largest means under a fixed error rate $\delta$ (fixed-confidence Top-m identification), for misspecified linear bandit models. This problem is motivated by practical applications, especially in medicine and recommendation systems, where linear models are popular due to their simplicity and the existence of efficient algorithms, but in which data inevitably deviates from linearity. In this work, we first derive a tractable lower bound on the sample complexity of any $\delta$-correct algorithm for the general Top-m identification problem. We show that knowing the scale of the deviation from linearity is necessary to exploit the structure of the problem. We then describe the first algorithm for this setting, which is both practical and adapts to the amount of misspecification. We derive an upper bound to its sample complexity which confirms this adaptivity and that matches the lower bound when $\delta$ $\rightarrow$ 0. Finally, we evaluate our algorithm on both synthetic and real-world data, showing competitive performance with respect to existing baselines.


Top-m identification for linear bandits

arXiv.org Artificial Intelligence

Motivated by an application to drug repurposing, we propose the first algorithms to tackle the identification of the m $\ge$ 1 arms with largest means in a linear bandit model, in the fixed-confidence setting. These algorithms belong to the generic family of Gap-Index Focused Algorithms (GIFA) that we introduce for Top-m identification in linear bandits. We propose a unified analysis of these algorithms, which shows how the use of features might decrease the sample complexity. We further validate these algorithms empirically on simulated data and on a simple drug repurposing task.