ThaiCoref: Thai Coreference Resolution Dataset

Trakuekul, Pontakorn, Leong, Wei Qi, Polpanumas, Charin, Sawatphol, Jitkapat, Tjhi, William Chandra, Rutherford, Attapol T.

Jun-9-2024–arXiv.org Artificial Intelligence

While coreference resolution is a well-established research area in Natural Language Processing (NLP), research focusing on Thai language remains limited due to the lack of large annotated corpora. In this work, we introduce ThaiCoref, a dataset for Thai coreference resolution. Our dataset comprises 777,271 tokens, 44,082 mentions and 10,429 entities across four text genres: university essays, newspapers, speeches, and Wikipedia. Our annotation scheme is built upon the OntoNotes benchmark with adjustments to address Thai-specific phenomena. Utilizing ThaiCoref, we train models employing a multilingual encoder and cross-lingual transfer techniques, achieving a best F1 score of 67.88\% on the test set. Error analysis reveals challenges posed by Thai's unique linguistic features. To benefit the NLP community, we make the dataset and the model publicly available at http://www.github.com/nlp-chula/thai-coref .

computational linguistic, coreference resolution, resolution, (13 more...)

arXiv.org Artificial Intelligence

Jun-9-2024

arXiv.org PDF

Add feedback

Country:
- South America > Brazil (0.04)
- Oceania > Australia (0.04)
- North America
  - Dominican Republic (0.04)
  - United States
    - Massachusetts (0.04)
    - Washington > King County
      - Seattle (0.04)
    - Virginia > Fairfax County
      - Fairfax (0.04)
    - Minnesota > Hennepin County
      - Minneapolis (0.14)
    - Maryland > Howard County
      - Columbia (0.04)
    - Louisiana > Orleans Parish
      - New Orleans (0.04)
  - Canada
    - Ontario > Toronto (0.04)
    - British Columbia > Metro Vancouver Regional District
      - Vancouver (0.04)
- Europe
  - Switzerland (0.04)
  - Slovenia (0.04)
  - Italy > Tuscany
    - Florence (0.04)
  - France > Provence-Alpes-Côte d'Azur
    - Bouches-du-Rhône > Marseille (0.04)
  - Denmark > Capital Region
    - Copenhagen (0.04)
  - United Kingdom > England
    - Cambridgeshire > Cambridge (0.04)
  - Portugal > Lisbon
    - Lisbon (0.04)
  - Croatia > Dubrovnik-Neretva County
    - Dubrovnik (0.04)
  - Belgium > Brussels-Capital Region
    - Brussels (0.04)
- Asia
  - Singapore (0.14)
  - Thailand (0.05)
  - China > Hong Kong (0.04)
  - Taiwan > Taiwan Province
    - Taipei (0.04)
  - Middle East > UAE
    - Abu Dhabi Emirate > Abu Dhabi (0.04)
  - Japan > Honshū
    - Kantō > Kanagawa Prefecture > Yokohama (0.04)

Genre:
- Research Report > New Finding (0.46)

Industry:
- Government (0.68)
- Media > News (0.48)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning (1.00)
  - Natural Language
    - Information Retrieval (0.49)
    - Grammars & Parsing (0.47)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found