random partition
Representation Learning via Consistent Assignment of Views over Random Partitions
CARP learns prototypes in an end-to-end online fashion using gradient descent without additional non-differentiable modules to solve the cluster assignment problem. CARP optimizes a new pretext task based on random partitions of prototypes that regularizes the model and enforces consistency between views' assignments.
- North America > Canada > Ontario > Toronto (0.14)
- South America > Brazil (0.04)
- Europe > Norway > Eastern Norway > Oslo (0.04)
An RKHS Perspective on Tree Ensembles
Dagdoug, Mehdi, Dombry, Clement, Duchamps, Jean-Jil
Random Forests and Gradient Boosting are among the most effective algorithms for supervised learning on tabular data. Both belong to the class of tree-based ensemble methods, where predictions are obtained by aggregating many randomized regression trees. In this paper, we develop a theoretical framework for analyzing such methods through Reproducing Kernel Hilbert Spaces (RKHSs) constructed on tree ensembles--more precisely, on the random partitions generated by randomized regression trees. We establish fundamental analytical properties of the resulting Random Forest kernel, including boundedness, continuity, and universality, and show that a Random Forest predictor can be characterized as the unique minimizer of a penalized empirical risk functional in this RKHS, providing a variational interpretation of ensemble learning. We further extend this perspective to the continuous-time formulation of Gradient Boosting introduced by Dombry and Duchamps (2024a,b), and demonstrate that it corresponds to a gradient flow on a Hilbert manifold induced by the Random Forest RKHS. A key feature of this framework is that both the kernel and the RKHS geometry are data-dependent, offering a theoretical explanation for the strong empirical performance of tree-based ensembles. Finally, we illustrate the practical potential of this approach by introducing a kernel principal component analysis built on the Random Forest kernel, which enhances the interpretability of ensemble models, as well as GVI, a new geometric variable importance criterion.
- North America > United States > New York > New York County > New York City (0.14)
- North America > United States > California > San Francisco County > San Francisco (0.14)
- Europe > France (0.04)
- (4 more...)
- Information Technology > Artificial Intelligence > Machine Learning > Ensemble Learning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Decision Tree Learning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.46)
- Asia > Middle East > Syria (0.14)
- North America > United States (0.14)
- Europe > Italy (0.05)
- (2 more...)
- Government (0.68)
- Health & Medicine (0.46)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.47)
- Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (0.46)
- Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (0.44)
- North America > Canada > Ontario > Toronto (0.14)
- South America > Brazil (0.04)
- Europe > Norway > Eastern Norway > Oslo (0.04)
Large-scale entity resolution via microclustering Ewens--Pitman random partitions
Beraha, Mario, Favaro, Stefano
We introduce the microclustering Ewens--Pitman model for random partitions, obtained by scaling the strength parameter of the Ewens--Pitman model linearly with the sample size. The resulting random partition is shown to have the microclustering property, namely: the size of the largest cluster grows sub-linearly with the sample size, while the number of clusters grows linearly. By leveraging the interplay between the Ewens--Pitman random partition with the Pitman--Yor process, we develop efficient variational inference schemes for posterior computation in entity resolution. Our approach achieves a speed-up of three orders of magnitude over existing Bayesian methods for entity resolution, while maintaining competitive empirical performance.
- North America > United States (0.14)
- Asia > Middle East > Jordan (0.14)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- (2 more...)
- Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (0.81)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.68)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty > Bayesian Inference (0.48)
- Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (0.34)
Predictive Coresets
We propose a construction of coresets based on a predictive view of Bayesian posterior inference (Fong et al., 2024; Fortini and Petrone, 2012). The main attraction of the approach is the model-agnostic nature - the method is valid with any inference model and independent of the specific inference goals, making it highly adaptable for a wide range of applications. Such adaptability is particularly valuable in the context of large-scale datasets, now commonplace in fields like genomics and astronomy. While this explosion of data offers incredible opportunities for discoveries, it also brings significant computational challenges. Tasks that were once straightforward, such as evaluating likelihoods several times have become increasingly difficult, making traditional data processing methods impractical. These obstacles have frequently pushed practitioners toward simpler statistical models that might not capture the full complexity of the data, disregarding expressiveness and flexibility that rich hierarchical and nonparametric models can offer.
- Asia > Middle East > Jordan (0.04)
- North America > United States > Texas > Travis County > Austin (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
Representation Learning via Consistent Assignment of Views over Random Partitions
CARP learns prototypes in an end-to-end online fashion using gradient descent without additional non-differentiable modules to solve the cluster assignment problem. CARP optimizes a new pretext task based on random partitions of prototypes that regularizes the model and enforces consistency between views' assignments. Additionally, our method improves training stability and prevents collapsed solutions in joint-embedding training. Through an extensive evaluation, we demonstrate that CARP's representations are suitable for learning downstream tasks. We evaluate CARP's representations capabilities in 17 datasets across many standard protocols, including linear evaluation, few-shot classification, k -NN, k -means, image retrieval, and copy detection.
ca46c1b9512a7a8315fa3c5a946e8265-Reviews.html
Alternatively, if the algorithm performs well against any of the existing AC ones then it would be good to see its performance in various settings and detailed explanation of its properties. The latter would be useful in its own right whether or not it is related to summarizing posterior distribution's characteristics. A few specific comments: Line 19: It is correct to say that the MAP might not be good in situations where the posterior is diffuse, I guess you mean uniform-like? It might also be multimodal, skewed (mode and mean differ), high variance,for example, so the MAP is not necessarily a good choice although it depends on the (underlying) loss function. Line 52: Since a Dirichlet Process is a discrete random probability measure, when sampling from it it induces a random partition (the ties will belong to the same cluster).
Flexible Models for with Application to Entity Resolution
Most generative models for clustering implicitly assume that the number of data points in each cluster grows linearly with the total number of data points. Finite mixture models, Dirichlet process mixture models, and Pitman-Yor process mixture models make this assumption, as do all other infinitely exchangeable clustering models. However, for some applications, this assumption is inappropriate. For example, when performing entity resolution, the size of each cluster should be unrelated to the size of the data set, and each cluster should contain a negligible fraction of the total number of data points. These applications require models that yield clusters whose sizes grow sublinearly with the size of the data set. We address this requirement by defining the microclustering property and introducing a new class of models that can exhibit this property. We compare models within this class to two commonly used clustering models using four entity-resolution data sets.
- Asia > Middle East > Syria (0.14)
- North America > United States (0.14)
- Europe > Italy (0.05)
- (2 more...)
- Government (0.68)
- Health & Medicine (0.46)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (0.84)
- Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (0.46)
Representation Learning via Consistent Assignment of Views over Random Partitions
Silva, Thalles, Rivera, Adín Ramírez
CARP learns prototypes in an end-to-end online fashion using gradient descent without additional non-differentiable modules to solve the cluster assignment problem. CARP optimizes a new pretext task based on random partitions of prototypes that regularizes the model and enforces consistency between views' assignments. Additionally, our method improves training stability and prevents collapsed solutions in joint-embedding training. Through an extensive evaluation, we demonstrate that CARP's representations are suitable for learning downstream tasks. We evaluate CARP's representations capabilities in 17 datasets across many standard protocols, including linear evaluation, few-shot classification, k-NN, k-means, image retrieval, and copy detection. We compare CARP performance to 11 existing self-supervised methods. We extensively ablate our method and demonstrate that our proposed random partition pretext task improves the quality of the learned representations by devising multiple random classification tasks. In transfer learning tasks, CARP achieves the best performance on average against many SSL methods trained for a longer time.
- North America > Canada > Ontario > Toronto (0.14)
- South America > Brazil (0.04)
- Europe > Norway > Eastern Norway > Oslo (0.04)