Gamma Mixture Modeling for Cosine Similarity in Small Language Models

Oct-8-2025–arXiv.org Artificial Intelligence

We study the cosine similarity of sentence transformer embeddings and observe that they are well modeled by gamma mixtures. From a fixed corpus, we measure similarities between all document embeddings and a reference query embedding. Empirically we find that these distributions are often well captured by a gamma distribution shifted and truncated to [ 1, 1], and in many cases, by a gamma mixture. We propose a heuristic model in which a hierarchical clustering of topics naturally leads to a gamma-mixture structure in the similarity scores. Finally, we outline an expectation-maximization algorithm for fitting shifted gamma mixtures, which provides a practical tool for modeling similarity distributions.

machine learning, natural language, public release and unlimited distribution, (11 more...)

arXiv.org Artificial Intelligence

Oct-8-2025

arXiv.org PDF

Add feedback

Country:
- North America > United States (0.68)

Genre:
- Research Report (0.82)

Industry:
- Government > Regional Government (0.46)

Technology:
- Information Technology > Artificial Intelligence
  - Natural Language > Text Processing (0.89)
  - Machine Learning > Statistical Learning
    - Clustering (0.34)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found