Information-Theoretic Generative Clustering of Documents
–arXiv.org Artificial Intelligence
We present {\em generative clustering} (GC) for clustering a set of documents, $\mathrm{X}$, by using texts $\mathrm{Y}$ generated by large language models (LLMs) instead of by clustering the original documents $\mathrm{X}$. Because LLMs provide probability distributions, the similarity between two documents can be rigorously defined in an information-theoretic manner by the KL divergence. We also propose a natural, novel clustering algorithm by using importance sampling. We show that GC achieves the state-of-the-art performance, outperforming any previous clustering method often by a large margin. Furthermore, we show an application to generative document retrieval in which documents are indexed via hierarchical clustering and our method improves the retrieval accuracy.
arXiv.org Artificial Intelligence
Dec-18-2024
- Country:
- North America > United States
- Minnesota > Hennepin County
- Minneapolis (0.14)
- Louisiana > Orleans Parish
- New Orleans (0.04)
- Minnesota > Hennepin County
- Asia
- Middle East > Jordan (0.05)
- Japan > Honshū
- Kantō > Tokyo Metropolis Prefecture > Tokyo (0.14)
- North America > United States
- Genre:
- Research Report > New Finding (1.00)
- Technology: