A Grounded Unsupervised Universal Part-of-Speech Tagger for Low-Resource Languages

Cardenas, Ronald, Lin, Ying, Ji, Heng, May, Jonathan

Apr-10-2019–arXiv.org Artificial Intelligence

Unsupervised part of speech (POS) tagging is often framed as a clustering problem, but practical taggers need to ground their clusters as well. Grounding generally requires reference labeled data, a luxury a low-resource language might not have. In this work, we describe an approach for low-resource unsupervised POS tagging that yields fully grounded output and requires no labeled training data. We find the classic method of Brown et al. (1992) clusters well in our use case and employ a decipherment-based approach to grounding. This approach presumes a sequence of cluster IDs is a'ciphertext' and seeks a POS tag-tocluster ID mapping that will reveal the POS sequence. We show intrinsically that, despite the difficulty of the task, we obtain reasonable performance across a variety of languages. We also show extrinsically that incorporating our POS tagger into a name tagger leads to stateof-the-art tagging performance in Sinhalese and Kinyarwanda, two languages with nearly no labeled POS data available. We further demonstrate our tagger's utility by incorporating Figure 1: Overview of our approach to grounded POS it into a true'zero-resource' variant of the tagging. We use an unsupervised clustering method MALOPA(Ammar et al., 2016) dependency (Section 3.2) then reduce and ground the clusters using parser model that removes the current reliance a decipherment approach informed by POS tag sequence on multilingual resources and gold POS tags data from many languages (Section 3.3).

computational linguistic, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

Apr-10-2019

arXiv.org PDF

Add feedback

Country:
- Oceania > Australia
  - Victoria > Melbourne (0.04)
- North America
  - United States
    - Oregon > Multnomah County
      - Portland (0.04)
    - Ohio > Franklin County
      - Columbus (0.04)
    - Michigan > Washtenaw County
      - Ann Arbor (0.04)
    - Massachusetts > Middlesex County
      - Cambridge (0.04)
    - Colorado > Denver County
      - Denver (0.04)
    - California > Los Angeles County
      - Los Angeles (0.14)
  - Canada > British Columbia
    - Metro Vancouver Regional District > Vancouver (0.04)
- Europe
  - Czechia > Prague (0.04)
  - Denmark > Capital Region
    - Copenhagen (0.04)
  - Sweden > Uppsala County
    - Uppsala (0.04)
  - Bulgaria > Sofia City Province
    - Sofia (0.04)
  - United Kingdom > England
    - Greater Manchester > Manchester (0.04)
  - Portugal > Lisbon
    - Lisbon (0.04)
  - Finland > Uusimaa
    - Helsinki (0.04)
  - France > Île-de-France
    - Paris > Paris (0.04)
  - Belgium > Brussels-Capital Region
    - Brussels (0.04)
- Asia
  - South Korea (0.04)
  - Singapore (0.04)
  - Middle East > Qatar
    - Ad-Dawhah > Doha (0.04)
  - China > Beijing
    - Beijing (0.04)

Genre:
- Research Report (0.50)

Technology:
- Information Technology > Artificial Intelligence
  - Natural Language > Grammars & Parsing (1.00)
  - Machine Learning > Statistical Learning
    - Clustering (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found