Investigating the Frequency Distortion of Word Embeddings and Its Impact on Bias Metrics

Valentini, Francisco, Sosa, Juan Cruz, Slezak, Diego Fernandez, Altszyler, Edgar

Oct-19-2023–arXiv.org Artificial Intelligence

Recent research has shown that static word embeddings can encode word frequency information. However, little has been studied about this phenomenon and its effects on downstream tasks. In the present work, we systematically study the association between frequency and semantic similarity in several static word embeddings. We find that Skip-gram, GloVe and FastText embeddings tend to produce higher semantic similarity between high-frequency words than between other frequency combinations. We show that the association between frequency and similarity also appears when words are randomly shuffled. This proves that the patterns found are not due to real semantic associations present in the texts, but are an artifact produced by the word embeddings. Finally, we provide an example of how word frequency can strongly impact the measurement of gender bias with embedding-based metrics. In particular, we carry out a controlled experiment that shows that biases can even change sign or reverse their order by manipulating word frequencies.

arXiv.org Artificial Intelligence

Oct-19-2023

arXiv.org PDF

Add feedback

Country:
- South America > Argentina
  - Pampas > Buenos Aires F.D. > Buenos Aires (0.04)
- Europe
  - Portugal > Lisbon
    - Lisbon (0.04)
  - Middle East > Malta
    - Port Region > Southern Harbour District > Valletta (0.04)
  - Denmark > Capital Region
    - Copenhagen (0.04)
- Asia
  - China > Hong Kong (0.04)
  - Japan > Honshū
    - Kansai > Osaka Prefecture > Osaka (0.04)

Genre:
- Research Report
  - Experimental Study (0.87)
  - New Finding (0.68)

Technology:
- Information Technology > Artificial Intelligence
  - Representation & Reasoning (1.00)
  - Natural Language > Text Processing (0.89)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found