An Inclusive Notion of Text
Kuznetsov, Ilia, Gurevych, Iryna
–arXiv.org Artificial Intelligence
Natural language processing (NLP) researchers develop models of grammar, meaning and communication based on written text. Due to task and data differences, what is considered text can vary substantially across studies. A conceptual framework for systematically capturing these differences is lacking. We argue that clarity on the notion of text is crucial for reproducible and generalizable NLP. Towards that goal, we propose common terminology to discuss the production and transformation of textual data, and introduce a two-tier taxonomy of linguistic and non-linguistic elements that are available in textual sources and can be used in NLP modeling. We apply this taxonomy to survey existing work that extends the notion of text beyond the conservative language-centered view. We outline key desiderata and challenges of the emerging inclusive approach to text in NLP, and suggest community-level reporting as a crucial next step to consolidate the discussion.
arXiv.org Artificial Intelligence
May-17-2023
- Country:
- South America (0.04)
- Oceania > Australia
- Victoria > Melbourne (0.04)
- New South Wales > Sydney (0.04)
- North America
- Dominican Republic (0.04)
- Central America (0.04)
- United States
- Washington > King County
- Seattle (0.04)
- New York > New York County
- New York City (0.04)
- Minnesota > Hennepin County
- Minneapolis (0.15)
- Louisiana > Orleans Parish
- New Orleans (0.04)
- Colorado > Boulder County
- Boulder (0.04)
- Washington > King County
- Europe
- Bulgaria (0.04)
- Middle East > Republic of Türkiye
- Istanbul Province > Istanbul (0.04)
- Italy > Tuscany
- Florence (0.04)
- Ireland > Leinster
- County Dublin > Dublin (0.04)
- Germany > Hesse
- Darmstadt Region > Darmstadt (0.04)
- Denmark > Capital Region
- Copenhagen (0.04)
- Belgium > Brussels-Capital Region
- Brussels (0.04)
- Asia
- Singapore (0.04)
- India (0.04)
- China > Hong Kong (0.04)
- Middle East
- UAE > Abu Dhabi Emirate
- Abu Dhabi (0.04)
- Republic of Türkiye > Istanbul Province
- Istanbul (0.04)
- UAE > Abu Dhabi Emirate
- Africa > Middle East
- Morocco (0.04)
- Genre:
- Research Report (0.82)
- Overview (0.68)
- Technology: