In Search of Ambiguity: A Three-Stage Workflow Design to Clarify Annotation Guidelines for Crowd Workers

Pradhan, Vivek Krishna, Schaekermann, Mike, Lease, Matthew

Dec-4-2021–arXiv.org Artificial Intelligence

While crowdsourcing now enables labeled data to be obtained more quickly, cheaply, and easily than ever before (Snow et al., 2008; Alonso, 2015; Sorokin and Forsyth, 2008), ensuring data quality remains something of an art, challenge, and perpetual risk. Consider a typical workflow for annotating data on Amazon Mechanical Turk (MTurk): a requester designs an annotation task, asks multiple workers to complete it, and then post-processes labels to induce final consensus labels. Because the annotation work itself is largely opaque, with only submitted labels being observable, the requester typically has little insight into what if any problems workers encounter during annotation. While statistical aggregation (Sheshadri and Lease, 2013; Hung et al., 2013; Zheng et al., 2017) and multi-pass iterative refinement (Little et al., 2010a; Goto et al., 2016) methods can be employed to further improve initial labels, there are limits to what can be achieved by post-hoc refinement following label collection. If initial labels are poor because many workers were confused by incomplete, unclear, or ambiguous task instructions, there is a significant risk of "garbage in equals garbage out" (Vidgen and Derczynski, 2020). In contrast, consider a more traditional annotation workflow involving trusted annotators, such as practiced by the Linguistic Data Consortium (LDC) (Griffitt and Strassel, 2016).

artificial intelligence, machine learning, natural language, (22 more...)

arXiv.org Artificial Intelligence

Dec-4-2021

arXiv.org PDF

Add feedback

Country:
- North America > United States
  - Hawaii (0.04)
  - Massachusetts > Suffolk County
    - Boston (0.04)
  - New York > New York County
    - New York City (0.04)
  - Texas > Travis County
    - Austin (0.04)

Genre:
- Research Report > New Finding (0.46)

Industry:
- Information Technology (0.46)

Technology:
- Information Technology
  - Artificial Intelligence
    - Cognitive Science (0.93)
    - Machine Learning (1.00)
    - Natural Language (1.00)
    - Representation & Reasoning (0.93)
  - Communications > Social Media
    - Crowdsourcing (0.92)
  - Human Computer Interaction (1.00)
  - Information Management > Search (0.93)