Unsupervised Document and Template Clustering using Multimodal Embeddings
Sampaio, Phillipe R., Maxcici, Helene
–arXiv.org Artificial Intelligence
We study unsupervised clustering of documents at both the category and template levels using frozen multimodal encoders and classical clustering algorithms. We systematize a model-agnostic pipeline that (i) projects heterogeneous last-layer states from text-layout-vision encoders into token-type-aware document vectors and (ii) performs clustering with centroid- or density-based methods, including an HDBSCAN + $k$-NN assignment to eliminate unlabeled points. We evaluate eight encoders (text-only, layout-aware, vision-only, and vision-language) with $k$-Means, DBSCAN, HDBSCAN + $k$-NN, and BIRCH on five corpora spanning clean synthetic invoices, their heavily degraded print-and-scan counterparts, scanned receipts, and real identity and certificate documents. The study reveals modality-specific failure modes and a robustness-accuracy trade-off, with vision features nearly solving template discovery on clean pages while text dominates under covariate shift, and fused encoders offering the best balance. We detail a reproducible, oracle-free tuning protocol and the curated evaluation settings to guide future work on unsupervised document organization.
arXiv.org Artificial Intelligence
Oct-28-2025
- Country:
- Asia > Myanmar
- Tanintharyi Region > Dawei (0.04)
- Europe
- France > Île-de-France
- Hauts-de-Seine > Nanterre (0.04)
- Paris > Paris (0.04)
- Switzerland (0.04)
- France > Île-de-France
- North America > United States
- North Dakota > McKenzie County (0.04)
- South America > Brazil
- Rio Grande do Sul > Porto Alegre (0.04)
- Asia > Myanmar
- Genre:
- Research Report (1.00)
- Technology:
- Information Technology
- Artificial Intelligence
- Machine Learning
- Neural Networks (1.00)
- Statistical Learning > Clustering (1.00)
- Natural Language (1.00)
- Representation & Reasoning (1.00)
- Vision (1.00)
- Machine Learning
- Data Science > Data Mining (1.00)
- Artificial Intelligence
- Information Technology