Annotations on a Budget: Leveraging Geo-Data Similarity to Balance Model Performance and Annotation Cost

Ignat, Oana, Bai, Longju, Nwatu, Joan, Mihalcea, Rada

Mar-12-2024–arXiv.org Artificial Intelligence

Current foundation models have shown impressive performance across various tasks. However, several studies have revealed that these models are not effective for everyone due to the imbalanced geographical and economic representation of the data used in the training process. Most of this data comes from Western countries, leading to poor results for underrepresented countries. To address this issue, more data needs to be collected from these countries, but the cost of annotation can be a significant bottleneck. In this paper, we propose methods to identify the data to be annotated to balance model performance and annotation costs. Our approach first involves finding the countries with images of topics (objects and actions) most visually distinct from those already in the training datasets used by current large vision-language foundation models. Next, we identify countries with higher visual similarity for these topics and show that using data from these countries to supplement the training data improves model performance and reduces annotation costs.

dataset, representation, similarity, (15 more...)

arXiv.org Artificial Intelligence

Mar-12-2024

arXiv.org PDF

Add feedback

Country:
- South America
  - Brazil (0.05)
  - Argentina (0.05)
  - Colombia (0.05)
  - Bolivia (0.05)
  - Peru (0.05)
  - Venezuela (0.04)
- North America
  - Haiti (0.05)
  - Mexico (0.05)
  - United States
    - Michigan (0.04)
    - New York > New York County
      - New York City (0.04)
  - Canada
    - Quebec > Capitale-Nationale Region
      - Québec (0.04)
      - Quebec City (0.04)
    - British Columbia > Metro Vancouver Regional District
      - Vancouver (0.04)
- Europe
  - Austria (0.05)
  - Spain (0.05)
  - Netherlands (0.05)
  - Italy (0.05)
  - Serbia (0.05)
  - Romania (0.05)
  - Czechia (0.05)
  - Ukraine (0.05)
  - France (0.05)
  - United Kingdom (0.04)
  - Bulgaria (0.04)
  - Denmark (0.04)
  - Western Europe (0.04)
  - Switzerland > Zürich
    - Zürich (0.14)
- Asia
  - Nepal (0.05)
  - Pakistan (0.05)
  - India (0.05)
  - Japan (0.05)
  - Myanmar (0.05)
  - Philippines (0.05)
  - Thailand (0.05)
  - South Korea (0.05)
  - Bangladesh (0.05)
  - Vietnam (0.05)
  - Cambodia (0.05)
  - China > Hong Kong (0.05)
  - East Asia (0.04)
  - Indonesia > Bali (0.04)
  - Middle East
    - UAE (0.14)
    - Jordan (0.05)
    - Saudi Arabia (0.05)
    - Republic of Türkiye (0.05)
    - Iran (0.05)
- Africa
  - Burundi (0.05)
  - Malawi (0.05)
  - Nigeria (0.05)
  - South Africa (0.05)
  - Burkina Faso (0.05)
  - Rwanda (0.05)
  - Zimbabwe (0.05)
  - Côte d'Ivoire (0.05)
  - Cameroon (0.05)
  - Tanzania (0.05)
  - Kenya (0.05)
  - Ethiopia (0.05)
  - Togo (0.05)
  - Middle East
    - Egypt (0.05)
    - Tunisia (0.05)

Genre:
- Research Report (0.82)

Technology:
- Information Technology > Artificial Intelligence
  - Vision (1.00)
  - Natural Language (1.00)
  - Machine Learning
    - Statistical Learning (0.93)
    - Neural Networks > Deep Learning (0.46)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found