Goto

Collaborating Authors

 conséquence


When do random forests fail?

Neural Information Processing Systems

Random forests are learning algorithms that build large collections of random trees and make predictions by averaging the individual tree predictions. In this paper, we consider various tree constructions and examine how the choice of parameters affects the generalization error of the resulting random forests as the sample size goes to infinity. We show that subsampling of data points during the tree construction phase is important: Forests can become inconsistent with either no subsampling or too severe subsampling. As a consequence, even highly randomized trees can lead to inconsistent forests if no subsampling is used, which implies that some of the commonly used setups for random forests can be inconsistent. As a second consequence we can show that trees that have good performance in nearest-neighbor search can be a poor choice for random forests.


Pragmatic by design: Engineering AI for the real world

MIT Technology Review

In physical systems where errors carry tangible consequences, AI creates value through reliability and first-time-right performance. The impact of artificial intelligence extends far beyond the digital world and into our everyday lives, across the cars we drive, the appliances in our homes, and medical devices that keep people alive. More and more, product engineers are turning to AI to enhance, validate, and streamline the design of the items that furnish our worlds. The use of AI in product engineering follows a disciplined and pragmatic trajectory. A significant majority of engineering organizations are increasing their AI investment, according to our survey, but they are doing so in a measured way. This approach reflects the priorities typical of product engineers.


Low-degree Lower bounds for clustering in moderate dimension

Carpentier, Alexandra, Verzelen, Nicolas

arXiv.org Machine Learning

We study the fundamental problem of clustering $n$ points into $K$ groups drawn from a mixture of isotropic Gaussians in $\mathbb{R}^d$. Specifically, we investigate the requisite minimal distance $Δ$ between mean vectors to partially recover the underlying partition. While the minimax-optimal threshold for $Δ$ is well-established, a significant gap exists between this information-theoretic limit and the performance of known polynomial-time procedures. Although this gap was recently characterized in the high-dimensional regime ($n \leq dK$), it remains largely unexplored in the moderate-dimensional regime ($n \geq dK$). In this manuscript, we address this regime by establishing a new low-degree polynomial lower bound for the moderate-dimensional case when $d \geq K$. We show that while the difficulty of clustering for $n \leq dK$ is primarily driven by dimension reduction and spectral methods, the moderate-dimensional regime involves more delicate phenomena leading to a "non-parametric rate". We provide a novel non-spectral algorithm matching this rate, shedding new light on the computational limits of the clustering problem in moderate dimension.



Theoretical guarantees in KL for Diffusion Flow Matching

Neural Information Processing Systems

A significant task in statistics and machine learning currently revolves around generating samples from a target distribution that is only accessible via a dataset.




Leveraging the two-timescale regime to demonstrate convergence of neural networks

Neural Information Processing Systems

Artificial neural networks are among the most successful modern machine learning methods, in particular because their non-linear parametrization provides a flexible way to implement feature learning (see, e.g., Goodfellow et al., 2016, chapter 15).