Learning from Disagreement: A Survey

Uma, Alexandra N., Fornaciari, Tommaso, Hovy, Dirk, Paun, Silviu, Plank, Barbara, Poesio, Massimo

Dec-27-2021–Journal of Artificial Intelligence Research

Many tasks in Natural Language Processing (NLP) and Computer Vision (CV) offer evidence that humans disagree, from objective tasks such as part-of-speech tagging to more subjective tasks such as classifying an image or deciding whether a proposition follows from certain premises. While most learning in artificial intelligence (AI) still relies on the assumption that a single (gold) interpretation exists for each item, a growing body of research aims to develop learning methods that do not rely on this assumption. In this survey, we review the evidence for disagreements on NLP and CV tasks, focusing on tasks for which substantial datasets containing this information have been created. We discuss the most popular approaches to training models from datasets containing multiple judgments potentially in disagreement. We systematically compare these different approaches by training them with each of the available datasets, considering several ways to evaluate the resulting models. Finally, we discuss the results in depth, focusing on four key research questions, and assess how the type of evaluation and the characteristics of a dataset determine the answers to these questions. Our results suggest, first of all, that even if we abandon the assumption of a gold standard, it is still essential to reach a consensus on how to evaluate models. This is because the relative performance of the various training methods is critically affected by the chosen form of evaluation. Secondly, we observed a strong dataset effect. With substantial datasets, providing many judgments by high-quality coders for each item, training directly with soft labels achieved better results than training from aggregated or even gold labels. This result holds for both hard and soft evaluation. But when the above conditions do not hold, leveraging both gold and soft labels generally achieved the best results in the hard evaluation. All datasets and models employed in this paper are freely available as supplementary materials.

dataset, disagreement, gold label, (14 more...)

Journal of Artificial Intelligence Research

Dec-27-2021

Journals PDF

Add feedback

Country:
- South America > Argentina (0.04)
- North America
  - United States
    - Maryland > Baltimore (0.04)
    - South Carolina > Horry County (0.04)
    - Oregon > Multnomah County
      - Portland (0.04)
    - New York > New York County
      - New York City (0.04)
    - Minnesota > Hennepin County
      - Minneapolis (0.14)
    - Michigan > Washtenaw County
      - Ann Arbor (0.04)
    - Massachusetts > Middlesex County
      - Cambridge (0.13)
    - Georgia > Fulton County
      - Atlanta (0.04)
  - Canada > Ontario
    - Toronto (0.04)
- Europe
  - France (0.04)
  - Italy (0.04)
  - Russia (0.04)
  - Czechia > Prague (0.04)
  - Slovenia (0.04)
  - Germany
    - Brandenburg > Potsdam (0.04)
    - Berlin (0.04)
  - Netherlands
    - South Holland > Dordrecht (0.04)
    - North Holland > Amsterdam (0.04)
  - Denmark > Capital Region
    - Copenhagen (0.04)
  - Bulgaria > Sofia City Province
    - Sofia (0.04)
  - Portugal > Lisbon
    - Lisbon (0.04)
  - Sweden > Vaestra Goetaland
    - Gothenburg (0.04)
  - United Kingdom
    - Northern Ireland (0.04)
    - Scotland > City of Edinburgh
      - Edinburgh (0.04)
  - Belgium > Brussels-Capital Region
    - Brussels (0.04)
- Asia
  - India (0.04)
  - South Korea (0.04)
  - Russia (0.04)
  - Singapore (0.04)
  - Middle East
    - Iraq (0.14)
    - Syria (0.04)
    - Lebanon (0.04)
    - Saudi Arabia (0.04)
    - Kuwait (0.04)
    - Jordan (0.04)
  - Japan > Kyūshū & Okinawa
    - Kyūshū > Nagasaki Prefecture > Nagasaki (0.04)
  - China
    - Hong Kong (0.04)
    - Beijing > Beijing (0.04)
- Africa > Middle East
  - Morocco > Marrakesh-Safi Region > Marrakesh (0.04)

Genre:
- Overview (1.00)
- Research Report
  - New Finding (1.00)
  - Experimental Study > Negative Result (0.66)

Industry:
- Law Enforcement & Public Safety > Crime Prevention & Enforcement (0.92)
- Law > Criminal Law (0.92)
- Government > Regional Government (0.67)
- Education (0.67)
- Health & Medicine > Diagnostic Medicine
  - Imaging (0.45)

Technology:
- Information Technology > Artificial Intelligence
  - Vision (1.00)
  - Natural Language
    - Grammars & Parsing (1.00)
    - Text Processing (0.93)
  - Machine Learning
    - Neural Networks > Deep Learning (0.92)
    - Performance Analysis > Accuracy (0.67)
    - Learning Graphical Models > Directed Networks
      - Bayesian Learning (0.45)