data2lang2vec: Data Driven Typological Features Completion

Amirzadeh, Hamidreza, Jafari, Sadegh, Harju, Anika, van der Goot, Rob

Sep-25-2024–arXiv.org Artificial Intelligence

Language typology databases enhance multi-lingual Natural Language Processing (NLP) by improving model adaptability to diverse linguistic structures. The widely-used lang2vec toolkit integrates several such databases, but its coverage remains limited at 28.9\%. Previous work on automatically increasing coverage predicts missing values based on features from other languages or focuses on single features, we propose to use textual data for better-informed feature prediction. To this end, we introduce a multi-lingual Part-of-Speech (POS) tagger, achieving over 70\% accuracy across 1,749 languages, and experiment with external statistical features and a variety of machine learning algorithms. We also introduce a more realistic evaluation setup, focusing on likely to be missing typology features, and show that our approach outperforms previous work in both setups.

classifier, computational linguistic, target feature, (13 more...)

arXiv.org Artificial Intelligence

Sep-25-2024

arXiv.org PDF

Add feedback

Country:
- North America > United States
  - Hawaii (0.04)
  - Minnesota > Hennepin County
    - Minneapolis (0.14)
- Europe
  - Spain > Valencian Community
    - Valencia Province > Valencia (0.04)
  - Iceland > Capital Region
    - Reykjavik (0.04)
  - France > Provence-Alpes-Côte d'Azur
    - Bouches-du-Rhône > Marseille (0.04)
  - Denmark > Capital Region
    - Copenhagen (0.04)
  - Croatia > Dubrovnik-Neretva County
    - Dubrovnik (0.04)
- Asia
  - Indonesia > Bali (0.04)
  - Middle East
    - Iran (0.05)
    - Qatar > Ad-Dawhah
      - Doha (0.04)

Genre:
- Research Report (0.50)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning > Statistical Learning (1.00)
  - Natural Language > Grammars & Parsing (0.91)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found