Data Representativity for Machine Learning and AI Systems

Clemmensen, Line H., Kjærsgaard, Rune D.

Feb-3-2023–arXiv.org Artificial Intelligence

These automated decision frameworks have demonstrated various unwanted consequences as a result of biased data [11, 66-68, 84, 86, 109]. Oftentimes these systems are trained on samples (datasets) from a larger population. Biased results can arise if the sample does not accurately represent the target population, or if there is a lack of sufficient representation for subgroups within the data. While the literature of data bias in machine Learning and artificial intelligence (AI) systems is rich [99], there exists only limited work on the connections between data representativity and AI systems. Terms like representative sample are used ubiquitously in the literature, often without further specification on the details or effects of this representativity. This paper analyzes and surveys data representativity in scientific literature relating to machine learning and AI systems by investigating how different notions of representativity are used and what effects adhering to different notions of data representativity has in relation to appropriate inference. The term representative sample is an overloaded term and a generally accepted definition of what constitutes a representative sample (subset of observations) is hard to find in the literature. A few examples demonstrate that at least a couple of definitions of representative sample exist. The most general definition we found is from D'Excelle (2014) and states ""Representative sampling" is a type of statistical sampling that allows us to use data from a sample to make conclusions that are representative for the population from which the sample is taken."

artificial intelligence, machine learning, representativity, (14 more...)

arXiv.org Artificial Intelligence

Feb-3-2023

arXiv.org PDF

Add feedback

Country:
- North America
  - Puerto Rico (0.04)
  - United States
    - California (0.06)
    - Massachusetts (0.04)
    - New York (0.04)
    - Alaska (0.04)
    - Utah (0.04)
    - Illinois (0.04)
- Europe
  - France (0.04)
  - Norway (0.04)
  - Germany (0.04)
  - United Kingdom > England
    - Cambridgeshire > Cambridge (0.04)
  - Spain > Catalonia
    - Barcelona Province > Barcelona (0.04)
  - Denmark > Capital Region
    - Kongens Lyngby (0.04)

Genre:
- Research Report > Experimental Study (1.00)
- Overview (1.00)

Industry:
- Health & Medicine (1.00)
- Information Technology (0.92)
- Government > Regional Government
  - North America Government > United States Government (0.87)

Technology:
- Information Technology > Artificial Intelligence > Machine Learning
  - Statistical Learning (0.93)
  - Performance Analysis > Accuracy (0.68)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found