Large-scale entity resolution via microclustering Ewens--Pitman random partitions
Beraha, Mario, Favaro, Stefano
We introduce the microclustering Ewens--Pitman model for random partitions, obtained by scaling the strength parameter of the Ewens--Pitman model linearly with the sample size. The resulting random partition is shown to have the microclustering property, namely: the size of the largest cluster grows sub-linearly with the sample size, while the number of clusters grows linearly. By leveraging the interplay between the Ewens--Pitman random partition with the Pitman--Yor process, we develop efficient variational inference schemes for posterior computation in entity resolution. Our approach achieves a speed-up of three orders of magnitude over existing Bayesian methods for entity resolution, while maintaining competitive empirical performance.
Jul-25-2025
- Country:
- Asia > Middle East
- Jordan (0.14)
- Europe
- Italy
- Lombardy > Milan (0.04)
- Piedmont > Turin Province
- Turin (0.04)
- United Kingdom > England
- Cambridgeshire > Cambridge (0.04)
- Italy
- North America > United States (0.14)
- Asia > Middle East
- Genre:
- Research Report (1.00)
- Industry:
- Health & Medicine (1.00)