ThaiCoref: Thai Coreference Resolution Dataset
Trakuekul, Pontakorn, Leong, Wei Qi, Polpanumas, Charin, Sawatphol, Jitkapat, Tjhi, William Chandra, Rutherford, Attapol T.
–arXiv.org Artificial Intelligence
While coreference resolution is a well-established research area in Natural Language Processing (NLP), research focusing on Thai language remains limited due to the lack of large annotated corpora. In this work, we introduce ThaiCoref, a dataset for Thai coreference resolution. Our dataset comprises 777,271 tokens, 44,082 mentions and 10,429 entities across four text genres: university essays, newspapers, speeches, and Wikipedia. Our annotation scheme is built upon the OntoNotes benchmark with adjustments to address Thai-specific phenomena. Utilizing ThaiCoref, we train models employing a multilingual encoder and cross-lingual transfer techniques, achieving a best F1 score of 67.88\% on the test set. Error analysis reveals challenges posed by Thai's unique linguistic features. To benefit the NLP community, we make the dataset and the model publicly available at http://www.github.com/nlp-chula/thai-coref .
arXiv.org Artificial Intelligence
Jun-9-2024
- Country:
- Asia
- Europe
- Belgium > Brussels-Capital Region
- Brussels (0.04)
- Croatia > Dubrovnik-Neretva County
- Dubrovnik (0.04)
- Slovenia (0.04)
- Portugal > Lisbon
- Lisbon (0.04)
- United Kingdom > England
- Cambridgeshire > Cambridge (0.04)
- Switzerland (0.04)
- Denmark > Capital Region
- Copenhagen (0.04)
- France > Provence-Alpes-Côte d'Azur
- Bouches-du-Rhône > Marseille (0.04)
- Italy > Tuscany
- Florence (0.04)
- Belgium > Brussels-Capital Region
- North America
- Canada
- British Columbia > Metro Vancouver Regional District
- Vancouver (0.04)
- Ontario > Toronto (0.04)
- British Columbia > Metro Vancouver Regional District
- Dominican Republic (0.04)
- United States
- Louisiana > Orleans Parish
- New Orleans (0.04)
- Maryland > Howard County
- Columbia (0.04)
- Massachusetts (0.04)
- Minnesota > Hennepin County
- Minneapolis (0.14)
- Virginia > Fairfax County
- Fairfax (0.04)
- Washington > King County
- Seattle (0.04)
- Louisiana > Orleans Parish
- Canada
- Oceania > Australia (0.04)
- South America > Brazil (0.04)
- Genre:
- Research Report > New Finding (0.46)
- Industry:
- Government (0.68)
- Media > News (0.48)
- Technology: