Grammars & Parsing
Learning to Predict from Textual Data
Radinsky, K., Davidovich, S., Markovitch, S.
Given a current news event, we tackle the problem of generating plausible predictions of future events it might cause. We present a new methodology for modeling and predicting such future news events using machine learning and data mining techniques. Our Pundit algorithm generalizes examples of causality pairs to infer a causality predictor. To obtain precisely labeled causality examples, we mine 150 years of news articles and apply semantic natural language modeling techniques to headlines containing certain predefined causality patterns. For generalization, the model uses a vast number of world knowledge ontologies. Empirical evaluation on real news articles shows that our Pundit algorithm performs as well as non-expert humans.
Learning Grounded Language through Situated Interactive Instruction
Mohan, Shiwali (University of Michigan) | Mininger, Aaron (University of Michigan) | Kirk, James (University of Michigan) | Laird, John E. (University of Michigan)
We present an approach for learning grounded language from mixed-initiative human-robot interaction. Prior work on learning from human instruction has concentrated on acquisition of task-execution knowledge from domain-specific language. In this work, we demonstrate acquisition of linguistic, semantic, perceptual, and procedural knowledge from mixed-initiative, natural language dialog. Our approach has been instantiated in a cognitive architecture, Soar, and has been deployed on a table-top robotic arm capable of picking up small objects. A preliminary analysis verifies the ability of the robot to acquire diverse knowledge from human-robot interaction.
Global and Local Approach of Part-of-Speech Tagging for Large Corpora
Yu, Shi (University of Chicago) | Grossman, Robert (University of Chicago) | Rzhetsky, Andrey (University of Chicago)
We present Global-Local POS tagging, a framework to train generative stochastic Part-of-Speech models on large corpora. Global Taggers offer several advantages over their counter parts trained on small, curated corpus, including the ability to automatically extend and update their models to new text. Global Taggers also avoid a fundamental limitation of current models, whose performance heavily relies on curated text with manually assigned labels. We illustrate our approach by training several Global Taggers, implemented with generative stochastic models, on two large corpora using high performance computing architecture. We further demonstrate that global taggers can be improved by incorporating models trained on curated text, called Local Taggers, for better tagging performance derived from specific topics.
Notes about the OntoGene Pipeline
Rinaldi, Fabio (University of Zurich) | Clematide, Simon (University of Zurich) | Schneider, Gerold (University of Zurich) | Grigonyte, Gintare (University of Zurich)
In this paper we describe the architecture of the OntoGene Relation mining pipeline and some of its recent applications. With this research overview paper we intend to provide a contribution towards the recently started discussion towards standards for information extraction architectures in the biomedical domain. Our approach delivers domain entities mentioned in each input document, as well as candidate relationships, both ranked according to a confidency score computed by the system. This information is presented to the user through an advanced interface aimed at supporting the process of interactive curation.
Automatic Formalization of Clinical Practice Guidelines
Gerber, Matthew (University of Virginia) | Brown, Donald (University of Virginia) | Harrison, James (University of Virginia)
Current efforts aim to incorporate knowledge from clinical practice guidelines (CPGs) into computer systems using sophisticated interchange formats. Due to their complexity, such formats require expensive manual formalization work. This paper presents a preliminary study of using natural language processing (NLP) to automatically formalize CPG recommendations. We developed a CPG representation using concepts from the Systematized Nomenclature of Medicine – Clinical Terms (SNOMED–CT), and manually applied this representation to a sample of CPG recommendations that is representative of multiple medical domains and recommendation types. Using this resource, we trained and evaluated a supervised classification model that formalizes new CPG recommendations according to the SNOMED–CT representation, achieving a precision of 75% and recall of 42% (F1 = 54%). We have identified two important lines of future investigation: (1) feature engineering to address the unique linguistic properties of CPG recommendations, and (2) alternative model formulations that are more robust to processing errors. A third line of investigation – creating additional training data for the NLP model – is shown to be of little utility.
A Tutorial on Dual Decomposition and Lagrangian Relaxation for Inference in Natural Language Processing
Dual decomposition, and more generally Lagrangian relaxation, is a classical method for combinatorial optimization; it has recently been applied to several inference problems in natural language processing (NLP). This tutorial gives an overview of the technique. We describe example algorithms, describe formal guarantees for the method, and describe practical issues in implementing the algorithms. While our examples are predominantly drawn from the NLP literature, the material should be of general relevance to inference problems in machine learning. A central theme of this tutorial is that Lagrangian relaxation is naturally applied in conjunction with a broad class of combinatorial algorithms, allowing inference in models that go significantly beyond previous work on Lagrangian relaxation for inference in graphical models.
Opinion Mining for Relating Subjective Expressions and Annual Earnings in US Financial Statements
Chen, Chien-Liang, Liu, Chao-Lin, Chang, Yuan-Chen, Tsai, Hsiang-Ping
Financial statements contain quantitative information and manager's subjective evaluation of firm's financial status. Using information released in U.S. 10-K filings. Both qualitative and quantitative appraisals are crucial for quality financial decisions. To extract such opinioned statements from the reports, we built tagging models based on the conditional random field (CRF) techniques, considering a variety of combinations of linguistic factors including morphology, orthography, predicate-argument structure, syntax, and simple semantics. Our results show that the CRF models are reasonably effective to find opinion holders in experiments when we adopted the popular MPQA corpus for training and testing. The contribution of our paper is to identify opinion patterns in multiword expressions (MWEs) forms rather than in single word forms. We find that the managers of corporations attempt to use more optimistic words to obfuscate negative financial performance and to accentuate the positive financial performance. Our results also show that decreasing earnings were often accompanied by ambiguous and mild statements in the reporting year and that increasing earnings were stated in assertive and positive way.
Towards Unsupervised Learning of Temporal Relations between Events
Mirroshandel, S.A., Ghassem-Sani, G.
Automatic extraction of temporal relations between event pairs is an important task for several natural language processing applications such as Question Answering, Information Extraction, and Summarization. Since most existing methods are supervised and require large corpora, which for many languages do not exist, we have concentrated our efforts to reduce the need for annotated data as much as possible. This paper presents two different algorithms towards this goal. The first algorithm is a weakly supervised machine learning approach for classification of temporal relations between events. In the first stage, the algorithm learns a general classifier from an annotated corpus. Then, inspired by the hypothesis of "one type of temporal relation per discourse'', it extracts useful information from a cluster of topically related documents. We show that by combining the global information of such a cluster with local decisions of a general classifier, a bootstrapping cross-document classifier can be built to extract temporal relations between events. Our experiments show that without any additional annotated data, the accuracy of the proposed algorithm is higher than that of several previous successful systems. The second proposed method for temporal relation extraction is based on the expectation maximization (EM) algorithm. Within EM, we used different techniques such as a greedy best-first search and integer linear programming for temporal inconsistency removal. We think that the experimental results of our EM based algorithm, as a first step toward a fully unsupervised temporal relation extraction method, is encouraging.
A practical approach to language complexity: a Wikipedia case study
Yasseri, Taha, Kornai, András, Kertész, János
We try to address the issue of language complexity empirically by comparing the simple English Wikipedia (Simple) to comparable samples of the main English Wikipedia (Main). Simple is supposed to use a more simplified language with a limited vocabulary, and editors are explicitly requested to follow this guideline, yet in practice the vocabulary richness of both samples are at the same level. Detailed analysis of longer units (n-grams of words and part of speech tags) shows that the language of Simple is less complex than that of Main primarily due to the use of shorter sentences, as opposed to drastically simplified syntax or vocabulary. Comparing the two language varieties by the Gunning readability index supports this conclusion. We also report on the topical dependence of language complexity, e.g. that the language is more advanced in conceptual articles compared to person-based (biographical) and object-based articles. Finally, we investigate the relation between conflict and language complexity by analyzing the content of the talk pages associated to controversial and peacefully developing articles, concluding that controversy has the effect of reducing language complexity.
Incremental Referent Grounding with NLP-Biased Visual Search
Cantrell, Rehj (Indiana University) | Krause, Evan (Tufts University) | Scheutz, Matthias (Tufts University) | Zillich, Michael (Technische Universitat Wien) | Potapova, Ekaterina (Technische Universitat Wien)
Human-robot interaction poses tight timing requirements on visual as well as natural language processing in order to allow for natural human-robot interaction. In particular, humans expect robots to incrementally resolve spoken references to visually perceivable objects as the referents are verbally described. In this paper, we present an integrated robotic architecture with novel incremental vision and natural language processing and demonstrate that incrementally refining attentional focus using linguistic constraints achieves significantly better performance of the vision system compared to non-incremental visual processing.