Data Representativity for Machine Learning and AI Systems

Clemmensen, Line H., Kjærsgaard, Rune D.

arXiv.org Artificial Intelligence 

These automated decision frameworks have demonstrated various unwanted consequences as a result of biased data [11, 66-68, 84, 86, 109]. Oftentimes these systems are trained on samples (datasets) from a larger population. Biased results can arise if the sample does not accurately represent the target population, or if there is a lack of sufficient representation for subgroups within the data. While the literature of data bias in machine Learning and artificial intelligence (AI) systems is rich [99], there exists only limited work on the connections between data representativity and AI systems. Terms like representative sample are used ubiquitously in the literature, often without further specification on the details or effects of this representativity. This paper analyzes and surveys data representativity in scientific literature relating to machine learning and AI systems by investigating how different notions of representativity are used and what effects adhering to different notions of data representativity has in relation to appropriate inference. The term representative sample is an overloaded term and a generally accepted definition of what constitutes a representative sample (subset of observations) is hard to find in the literature. A few examples demonstrate that at least a couple of definitions of representative sample exist. The most general definition we found is from D'Excelle (2014) and states ""Representative sampling" is a type of statistical sampling that allows us to use data from a sample to make conclusions that are representative for the population from which the sample is taken."

Duplicate Docs Excel Report

Title
None found

Similar Docs  Excel Report  more

TitleSimilaritySource
None found