Resource-Size matters: Improving Neural Named Entity Recognition with Optimized Large Corpora
Ahmed, Sajawel, Mehler, Alexander
Named Entity Recognition (NER) is a crucial part of various Natural Language Processing (NLP) tasks like entity linking, relation extraction, machine reading and ultimately Question Answering (QA). With the recent rise of neural networks, much emphasis has been put on high-resource languages like English or Chinese leading to fast advancements of many foundational tasks, in particular NER which in many areas reaches near-human performance for these languages [1], [2]. However, for other, less-resource languages like German, their neural NER counterparts did not attract similar attention from the deep learning community, leading to lower performance by a margin of up to 11% F-score. In this paper, we look for the reasons and take steps towards solving them. By example of German we bridge the current gap between the performance of neural NER for different languages and bring the performance to a new state-of-theart. We report evidence that the inferior quality of German text data and its small size are the major reasons for the observed lack of progress. To tackle this problem, we use a larger corpus for training the foundational word embeddings, namely Leipzig40 [3] (including the whole German Wikipedia till 2016) combined with the WMT 2010 German monolingual training data [4], and contrast its use with the COW corpus [5], the largest collection of German texts extracted from web documents with over 617 Mio.
Jul-26-2018
- Country:
- Europe
- Germany
- Saxony > Leipzig (0.05)
- Saarland > Saarbrücken (0.04)
- Hesse > Darmstadt Region
- Frankfurt (0.04)
- Baden-Württemberg > Tübingen Region
- Tübingen (0.04)
- France > Île-de-France
- Germany
- Europe
- Genre:
- Research Report (0.64)
- Technology: