Quantifying the Dissimilarity of Texts
Shade, Benjamin, Altmann, Eduardo G.
–arXiv.org Artificial Intelligence
Quantifying the dissimilarity of two texts is an important aspect of a number of natural language processing tasks, including semantic information retrieval, topic classification, and document clustering. In this paper, we compared the properties and performance of different dissimilarity measures $D$ using three different representations of texts -- vocabularies, word frequency distributions, and vector embeddings -- and three simple tasks -- clustering texts by author, subject, and time period. Using the Project Gutenberg database, we found that the generalised Jensen--Shannon divergence applied to word frequencies performed strongly across all tasks, that $D$'s based on vector embedding representations led to stronger performance for smaller texts, and that the optimal choice of approach was ultimately task-dependent. We also investigated, both analytically and numerically, the behaviour of the different $D$'s when the two texts varied in length by a factor $h$. We demonstrated that the (natural) estimator of the Jaccard distance between vocabularies was inconsistent and computed explicitly the $h$-dependency of the bias of the estimator of the generalised Jensen--Shannon divergence applied to word frequencies. We also found numerically that the Jensen--Shannon divergence and embedding-based approaches were robust to changes in $h$, while the Jaccard distance was not.
arXiv.org Artificial Intelligence
May-3-2023
- Country:
- Asia
- China
- Middle East
- Jordan (0.04)
- Qatar > Ad-Dawhah
- Doha (0.04)
- Saudi Arabia > Riyadh Province
- Riyadh (0.04)
- Europe
- Belgium > Brussels-Capital Region
- Brussels (0.04)
- Germany > Saxony
- Dresden (0.04)
- Greece (0.04)
- Italy
- Piedmont > Turin Province
- Turin (0.04)
- Tuscany > Florence (0.04)
- Piedmont > Turin Province
- Netherlands (0.04)
- Belgium > Brussels-Capital Region
- North America > United States
- California
- San Diego County > San Diego (0.04)
- San Francisco County > San Francisco (0.14)
- Colorado > Denver County
- Denver (0.04)
- Minnesota > Hennepin County
- Minneapolis (0.04)
- Nevada > Clark County
- Las Vegas (0.04)
- New Jersey > Hudson County
- Hoboken (0.04)
- New York (0.04)
- California
- Oceania > Australia
- New South Wales > Sydney (0.04)
- South America > Brazil (0.04)
- Asia
- Genre:
- Research Report > New Finding (0.46)
- Technology: