Building a Wikipedia Text Corpus for Natural Language Processing


One of the first things required for natural language processing (NLP) tasks is a corpus. In linguistics and NLP, corpus (literally Latin for body) refers to a collection of texts. Such collections may be formed of a single language of texts, or can span multiple languages -- there are numerous reasons for which multilingual corpora (the plural of corpus) may be useful. Corpora may also consist of themed texts (historical, Biblical, etc.). Corpora are generally solely used for statistical linguistic analysis and hypothesis testing.

Learning Slowly To Learn Better: Curriculum Learning for Legal Ontology Population

AAAI Conferences

In this paper, we present an ontology population approach for legal ontologies. We exploit Wikipedia as a source of manually annotated examples of legal entities. We align YAGO, a Wikipedia-based ontology, and LKIF, an ontology specifically designed for the legal domain. Through this alignment, we can effectively populate the LKIF ontology, with the aim to obtain examples to train a Named Entity Recognizer and Classifier to be used for finding and classifying entities in legal texts. Since examples of annotated data in the legal domain are very few, we apply a machine learning strategy called curriculum learning aimed to overcome problems of overfitting by learning increasingly more complex concepts. We compare the performance of this method to identify Named Entities with respect to batch learning as well as two other baselines. Results are satisfying and foster further research in this direction.

Building a Massive Corpus for Named Entity Recognition using Free Open Data Sources Artificial Intelligence

Building a Massive Corpus for Named Entity Recognition using Free Open Data Sources Daniel Specht Menezes Departamento de Inform atica PUC-Rio Rio de Janeiro, Brazil Abstract --With the recent progress in machine learning, boosted by techniques such as deep learning, many tasks can be successfully solved once a large enough dataset is available for training. Nonetheless, human-annotated datasets are often expensive to produce, especially when labels are fine-grained, as is the case of Named Entity Recognition (NER), a task that operates with labels on a word-level. In this paper, we propose a method to automatically generate labeled datasets for NER from public data sources by exploiting links and structured data from DBpedia and Wikipedia. Due to the massive size of these data sources, the resulting dataset - SESAME 1 - is composed of millions of labeled sentences. We detail the method to generate the dataset, report relevant statistics, and design a baseline using a neural network, showing that our dataset helps building better NER predictors.

Introducing Hypertension FACT: Vital Sign Ontology Annotations in the Florida Annotated Corpus for Translational Science

AAAI Conferences

We introduce the Florida Annotated Corpus for Translational Science (FACTS), which currently consists of 20 case reports about hypertension that we annotated with Vital Sign Ontology (VSO) classes. We describe the annotation method, the annotation results, interannotator agreement measure, and the availability of the corpus and supporting tools for annotating corpora with OWL ontologies. We also discuss issues and limitations of VSO for annotating vital sign data in case reports.

Publishing Math Lecture Notes as Linked Data Artificial Intelligence

We mark up a corpus of LaTeX lecture notes semantically and expose them as Linked Data in XHTML+MathML+RDFa. Our application makes the resulting documents interactively browsable for students. Our ontology helps to answer queries from students and lecturers, and paves the path towards an integration of our corpus with external sites.