Predicting Lexical Complexity in English Texts: The Complex 2.0 Dataset
Shardlow, Matthew, Evans, Richard, Zampieri, Marcos
–arXiv.org Artificial Intelligence
Identifying words which may cause difficulty for a reader is an essential step in most lexical text simplification systems prior to lexical substitution and can also be used for assessing the readability of a text. This task is commonly referred to as Complex Word Identification (CWI) and is often modelled as a supervised classification problem. For training such systems, annotated datasets in which words and sometimes multi-word expressions are labelled regarding complexity are required. In this paper we analyze previous work carried out in this task and investigate the properties of CWI datasets for English. We develop a protocol for the annotation of lexical complexity and use this to annotate a new dataset, CompLex 2.0. We present experiments using both new and old datasets to investigate the nature of lexical complexity. We found that a Likert-scale annotation protocol provides an objective setting that is superior for identifying the complexity of words compared to a binary annotation protocol. We release a new dataset using our new protocol to promote the task of Lexical Complexity Prediction.
arXiv.org Artificial Intelligence
Nov-3-2022
- Country:
- Oceania > Australia (0.04)
- North America > United States
- Maryland > Baltimore (0.14)
- Colorado (0.04)
- Minnesota > Hennepin County
- Minneapolis (0.14)
- New Jersey > Bergen County
- Mahwah (0.04)
- Rhode Island > Providence County
- Providence (0.04)
- Louisiana > Orleans Parish
- New Orleans (0.04)
- Oregon > Multnomah County
- Portland (0.04)
- Massachusetts > Middlesex County
- Cambridge (0.04)
- Washington > King County
- Seattle (0.04)
- California
- San Diego County > San Diego (0.05)
- Santa Clara County > Palo Alto (0.04)
- New York > New York County
- New York City (0.04)
- Europe
- Netherlands (0.04)
- Greece (0.04)
- Bulgaria
- Sofia City Province > Sofia (0.04)
- Varna Province > Varna (0.04)
- Iceland > Capital Region
- Reykjavik (0.04)
- Italy > Tuscany
- Florence (0.04)
- France > Provence-Alpes-Côte d'Azur
- Bouches-du-Rhône > Marseille (0.04)
- Denmark > Capital Region
- Copenhagen (0.04)
- United Kingdom > England
- West Midlands > Wolverhampton (0.04)
- Belgium > Brussels-Capital Region
- Brussels (0.04)
- Asia > Taiwan
- Taiwan Province > Taipei (0.04)
- Africa > Middle East
- Egypt (0.04)
- Genre:
- Research Report > New Finding (1.00)
- Industry:
- Health & Medicine > Therapeutic Area (0.46)
- Technology: