Resource-Size matters: Improving Neural Named Entity Recognition with Optimized Large Corpora

Jul-26-2018–arXiv.org Machine Learning

Named Entity Recognition (NER) is a crucial part of various Natural Language Processing (NLP) tasks like entity linking, relation extraction, machine reading and ultimately Question Answering (QA). With the recent rise of neural networks, much emphasis has been put on high-resource languages like English or Chinese leading to fast advancements of many foundational tasks, in particular NER which in many areas reaches near-human performance for these languages [1], [2]. However, for other, less-resource languages like German, their neural NER counterparts did not attract similar attention from the deep learning community, leading to lower performance by a margin of up to 11% F-score. In this paper, we look for the reasons and take steps towards solving them. By example of German we bridge the current gap between the performance of neural NER for different languages and bring the performance to a new state-of-theart. We report evidence that the inferior quality of German text data and its small size are the major reasons for the observed lack of progress. To tackle this problem, we use a larger corpus for training the foundational word embeddings, namely Leipzig40 [3] (including the whole German Wikipedia till 2016) combined with the WMT 2010 German monolingual training data [4], and contrast its use with the COW corpus [5], the largest collection of German texts extracted from web documents with over 617 Mio.

artificial intelligence, machine learning, natural language, (20 more...)

arXiv.org Machine Learning

Jul-26-2018

arXiv.org PDF

Add feedback

Country:
- Europe
  - Germany
    - Saxony > Leipzig (0.05)
    - Saarland > Saarbrücken (0.04)
    - Hesse > Darmstadt Region
      - Frankfurt (0.04)
    - Baden-Württemberg > Tübingen Region
      - Tübingen (0.04)
  - France > Île-de-France
    - Paris > Paris (0.04)

Genre:
- Research Report (0.64)

Technology:
- Information Technology > Artificial Intelligence
  - Natural Language > Text Processing (1.00)
  - Machine Learning > Neural Networks
    - Deep Learning (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found