Goto

Collaborating Authors

 Grammars & Parsing


Mining Meaning from Wikipedia

arXiv.org Artificial Intelligence

Wikipedia is a goldmine of information; not just for its many readers, but also for the growing community of researchers who recognize it as a resource of exceptional scale and utility. It represents a vast investment of manual effort and judgment: a huge, constantly evolving tapestry of concepts and relations that is being applied to a host of tasks. This article provides a comprehensive description of this work. It focuses on research that extracts and makes use of the concepts, relations, facts and descriptions found in Wikipedia, and organizes the work into four broad categories: applying Wikipedia to natural language processing; using it to facilitate information retrieval and information extraction; and as a resource for ontology building. The article addresses how Wikipedia is being used as is, how it is being improved and adapted, and how it is being combined with other structures to create entirely new resources. We identify the research groups and individuals involved, and how their work has developed in the last few years. We provide a comprehensive list of the open-source software they have produced.


Sentence Compression as Tree Transduction

Journal of Artificial Intelligence Research

This paper presents a tree-to-tree transduction method for sentence compression. Our model is based on synchronous tree substitution grammar, a formalism that allows local distortion of the tree topology and can thus naturally capture structural mismatches. We describe an algorithm for decoding in this framework and show how the model can be trained discriminatively within a large margin framework. Experimental results on sentence compression bring significant improvements over a state-of-the-art model.


Identification of Pleonastic It Using the Web

Journal of Artificial Intelligence Research

In a significant minority of cases, certain pronouns, especially the pronoun it, can be used without referring to any specific entity. This phenomenon of pleonastic pronoun usage poses serious problems for systems aiming at even a shallow understanding of natural language texts. In this paper, a novel approach is proposed to identify such uses of it: the extrapositional cases are identified using a series of queries against the web, and the cleft cases are identified using a simple set of syntactic rules. The system is evaluated with four sets of news articles containing 679 extrapositional cases as well as 78 cleft constructs. The identification results are comparable to those obtained by human efforts.


Accelerating and Evaluation of Syntactic Parsing in Natural Language Question Answering Systems

arXiv.org Artificial Intelligence

With the development of Natural Language Processing (NLP), more and more systems want to adopt NLP in User Interface Module to process user input, in order to communicate with user in a natural way. However, this raises a speed problem. That is, if NLP module can not process sentences in durable time delay, users will never use the system. As a result, systems which are strict with processing time, such as dialogue systems, web search systems, automatic customer service systems, especially real-time systems, have to abandon NLP module in order to get a faster system response. This paper aims to solve the speed problem. In this paper, at first, the construction of a syntactic parser which is based on corpus machine learning and statistics model is introduced, and then a speed problem analysis is performed on the parser and its algorithms. Based on the analysis, two accelerating methods, Compressed POS Set and Syntactic Patterns Pruning, are proposed, which can effectively improve the time efficiency of parsing in NLP module. To evaluate different parameters in the accelerating algorithms, two new factors, PT and RT, are introduced and explained in detail. Experiments are also completed to prove and test these methods, which will surely contribute to the application of NLP.


A Bayesian LDA-based model for semi-supervised part-of-speech tagging

Neural Information Processing Systems

We present a novel Bayesian model for semi-supervised part-of-speech tagging. Our model extends the Latent Dirichlet Allocation model and incorporates the intuition that words' distributions over tags, p(t w), are sparse. In addition we introduce a model for determining the set of possible tags of a word which captures important dependencies in the ambiguity classes of words. Our model outperforms the best previously proposed model for this task on a standard dataset.


Discriminative Log-Linear Grammars with Latent Variables

Neural Information Processing Systems

We demonstrate that log-linear grammars with latent variables can be practically trained using discriminative methods. Central to efficient discriminative training is a hierarchical pruning procedure which allows feature expectations to be efficiently approximated in a gradient-based procedure.


Rapid Inference on a Novel AND/OR graph for Object Detection, Segmentation and Parsing

Neural Information Processing Systems

In this paper we formulate a novel AND/OR graph representation capable of describing the different configurations of deformable articulated objects such as horses. The representation makes use of the summarization principle so that lower level nodes in the graph only pass on summary statistics to the higher level nodes. The probability distributions are invariant to position, orientation, and scale. We develop a novel inference algorithm that combined a bottom-up process for proposing configurations for horses together with a top-down process for refining and validating these proposals. The strategy of surround suppression is applied to ensure that the inference time is polynomial in the size of input data. The algorithm was applied to the tasks of detecting, segmenting and parsing horses. We demonstrate that the algorithm is fast and comparable with the state of the art approaches.


A Bayesian LDA-based model for semi-supervised part-of-speech tagging

Neural Information Processing Systems

We present a novel Bayesian model for semi-supervised part-of-speech tagging. Our model extends the Latent Dirichlet Allocation model and incorporates the intuition that words' distributions over tags, p(t w), are sparse. In addition we introduce a model for determining the set of possible tags of a word which captures important dependencies in the ambiguity classes of words. Our model outperforms the best previously proposed model for this task on a standard dataset.


Discriminative Log-Linear Grammars with Latent Variables

Neural Information Processing Systems

We demonstrate that log-linear grammars with latent variables can be practically trained using discriminative methods. Central to efficient discriminative training is a hierarchical pruning procedure which allows feature expectations to be efficiently approximated in a gradient-based procedure.


Rapid Inference on a Novel AND/OR graph for Object Detection, Segmentation and Parsing

Neural Information Processing Systems

In this paper we formulate a novel AND/OR graph representation capable of describing the different configurations of deformable articulated objects such as horses. The representation makes use of the summarization principle so that lower level nodes in the graph only pass on summary statistics to the higher level nodes. The probability distributions are invariant to position, orientation, and scale. We develop a novel inference algorithm that combined a bottom-up process for proposing configurations for horses together with a top-down process for refining and validating these proposals. The strategy of surround suppression is applied to ensure that the inference time is polynomial in the size of input data. The algorithm was applied to the tasks of detecting, segmenting and parsing horses. We demonstrate that the algorithm is fast and comparable with the state of the art approaches.