Building a Wikipedia Text Corpus for Natural Language Processing


One of the first things required for natural language processing (NLP) tasks is a corpus. In linguistics and NLP, corpus (literally Latin for body) refers to a collection of texts. Such collections may be formed of a single language of texts, or can span multiple languages -- there are numerous reasons for which multilingual corpora (the plural of corpus) may be useful. Corpora may also consist of themed texts (historical, Biblical, etc.). Corpora are generally solely used for statistical linguistic analysis and hypothesis testing.

Learning Slowly To Learn Better: Curriculum Learning for Legal Ontology Population

AAAI Conferences

In this paper, we present an ontology population approach for legal ontologies. We exploit Wikipedia as a source of manually annotated examples of legal entities. We align YAGO, a Wikipedia-based ontology, and LKIF, an ontology specifically designed for the legal domain. Through this alignment, we can effectively populate the LKIF ontology, with the aim to obtain examples to train a Named Entity Recognizer and Classifier to be used for finding and classifying entities in legal texts. Since examples of annotated data in the legal domain are very few, we apply a machine learning strategy called curriculum learning aimed to overcome problems of overfitting by learning increasingly more complex concepts. We compare the performance of this method to identify Named Entities with respect to batch learning as well as two other baselines. Results are satisfying and foster further research in this direction.

Publishing Math Lecture Notes as Linked Data Artificial Intelligence

We mark up a corpus of LaTeX lecture notes semantically and expose them as Linked Data in XHTML+MathML+RDFa. Our application makes the resulting documents interactively browsable for students. Our ontology helps to answer queries from students and lecturers, and paves the path towards an integration of our corpus with external sites.

Introducing Hypertension FACT: Vital Sign Ontology Annotations in the Florida Annotated Corpus for Translational Science

AAAI Conferences

We introduce the Florida Annotated Corpus for Translational Science (FACTS), which currently consists of 20 case reports about hypertension that we annotated with Vital Sign Ontology (VSO) classes. We describe the annotation method, the annotation results, interannotator agreement measure, and the availability of the corpus and supporting tools for annotating corpora with OWL ontologies. We also discuss issues and limitations of VSO for annotating vital sign data in case reports.