Goto

Collaborating Authors

 Country


Hidden Markov Random Fields Based LSI Text Semi-supervised Clustering

AAAI Conferences

Semi-supervised learning is an active research field. Previous results shown that unite background information into the original unsupervised clustering problem could archive higher accuracy. In this paper, we explore the cooperation between the pairwise constrains given by the user and the sematic information in natural language. In addition, we reduce the time complexity to make the algorithm feasible for large quantities of data. Experiments on different scales of corpus show the robustness and effectiveness of the proposed algorithm, which the F-measure archives 20% higher than previous algorithms.


Simplification of Patent Claim Sentences for their Paraphrasing and Summarization

AAAI Conferences

We present an approach to patent claim simplification which segments claim sentences into clausal discourse units, transforms them into complete sentences, establishes coreference relations and builds a discourse structure between discourse units. The four stages are necessary to allow for the syntactic analysis of otherwise unparsable claim sentences and their regeneration using discourse structure and coreference relations in order to ensure the production of a cohesive and coherent paraphrase/summary.


The Role of Knowledge-based Features in Polarity Classification at Sentence Level

AAAI Conferences

Though polarity classification has been extensively explored at document level, there has been little work investigating feature design at sentence level. Due to the small number of words within a sentence, polarity classification at sentence level differs substantially from document-level classification in that resulting bag-of-words feature vectors tend to be very sparse resulting in a lower classification accuracy. In this paper, we show that performance can be improved by adding features specifically designed for sentence-level polarity classification. We consider both explicit polarity information and various linguistic features. A great proportion of the improvement that can be obtained by using polarity information can also be achieved by using a set of simple domain-independent linguistic features.


c-rater:Automatic Content Scoring for Short Constructed Responses

AAAI Conferences

The education community is moving towards constructed or free-text responses and computer-based assessment. At the same time, progress in natural language processing and knowledge representation has made it possible to consider free-text or constructed responses without having to fully understand the text. c-rater is a technology at Educational Testing Service (ETS) used for automatic content scoring for short, free-text responses. This paper describes some of the major developments made in c-rater recently.


SlidesGen: Automatic Generation of Presentation Slides for a Technical Paper Using Summarization

AAAI Conferences

Presentations are one of the most common and effective ways of communicating the overview of a work to the audience. Given a technical paper, automatic generation of presentation slides reduces the effort of the presenter and helps in creating a structured summary of the paper. In this paper, we propose the framework of a novel system that does this task. Any paper that has an abstract and whose sections can be categorized under introduction, related work, model, experiments and conclusions can be given as input. As documents in LaTeX are rich in structural and semantic information we used them as input to our system. These documents are initially converted to XML format. This XML file is parsed and information in it is extracted. A query specific extractive summarizer has been used to generate slides. All graphical elements from the paper are made well use of by placing them at appropriate locations in the slides. These slides are presented in the document order.


Computational Considerations in Correcting User-Language

AAAI Conferences

This study evaluates the robustness of established computational indices used to assess text relatedness in user-language. The original User-Language Paraphrase Corpus (ULPC) was compared to a corrected version, in which each paraphrase was corrected for typographical and grammatical errors. Error correction significantly affected values for each of five computational indices, indicating greater similarity of the target sentence to the corrected paraphrase than to the original paraphrase. Moreover, misspelled target words accounted for a large proportion of the differences. This study also evaluated potential effects on correlations between computational indices and human ratings of paraphrases. The corrections did not yield assessments that were any more or less comparable to trained human raters than were the original paraphrases containing typographical or grammatical errors. The results suggest that although correcting for errors may optimize certain computational indices, the corrections are not necessary for comparing the indices to expert ratings.


Testing Analogical Proportions with Google using Kolmogorov Information Theory

AAAI Conferences

Analogical reasoning is considered as one of the main mechanisms underlying creativity. "Thinking out of the box" allows the paradigm shift essential to a creative process. More common is the concept of analogical proportion ("2 is to 4 as 4 is to 8") which can be described within an algebraic framework. When it comes to concepts ("engine is to the car as heart is to the human"), we need to investigate a new way to understand this analogical ratio. In this paper, we take inspiration from the formal framework of information theory for proposing a new approach to the evaluation of analogy between concepts. Using Kolmogorov complexity as a backbone providing a clear semantics, we give a practical interpretation for analogy between words viewed as labeling concepts. Making use of Google as a linguistic resource, we provide an implementation of our definitions: experiments show that the accuracy of our definition is quite acceptable and justify the approach.


Computational Replication of Human Paraphrase Assessment

AAAI Conferences

Two sentences are paraphrases if their meanings are equivalent but their words and syntax are different. Paraphrasing can be used to aid comprehension, stimulate prior knowledge, and assist in writing skills development. While automated paraphrase assessment is both common-place and useful, research has centered solely on artificial, edited paraphrases and has used only binary dimensions (i.e., is or is-not a paraphrase). In this study, we use 1998 natural paraphrases generated by high school students that have been assessed along 10 dimensions of paraphrase (e.g., semantic completeness). This study investigates the components of paraphrase quality emerging from these dimensions, and examines whether computational approaches (e.g. LSA, MED) can simulate those human evaluations. The results suggest that semantic and syntactic evaluations are the primary components of paraphrase quality, and that computationally light systems such as LSA (semantics) and MED (syntax) present promising approaches to simulating human evaluations of paraphrases.


Paraphrase Identification Using Weighted Dependencies and Word Semantics

AAAI Conferences

In this paper we propose a novel approach to the task of paraphrase identification. The proposed approach quantifies both the similarity and dissimilarity between two sentences. The similarity and dissimilarity is assessed based on lexico-semantic information, i.e., word semantics, and syntactic information in the form of dependencies, which are explicit syntactic relations between words in a sentence. Word semantics requires mapping words onto concepts in a taxonomy and then using word-to-word similarity metrics to compute their semantic relatedness. Dependencies are obtained using state-of-the-art dependency parsers. One important aspect of our approach is the weighting of missing dependencies, i.e., syntactic relations present in one sentence but not the other. We report experimental results on the Microsoft Paraphrase Corpus, a standard data set for evaluating approaches to paraphrase identification. The experiments showed that the proposed approach offers state-of-the-art results. In particular, our approach offers better precision when compared to other state-of-the-art systems.


CombiTagger: A System for Developing Combined Taggers

AAAI Conferences

The main task of part-of-speech (PoS) tagging is to assign the appropriate morphosyntactic category to each word in a sentence. A combination of different PoS taggers usually results in higher tagging accuracy than obtained by the use of only a single tagger. We present a new language and tagset independent system, CombiTagger, which combines automatically the output of several taggers. The system, which is open source, provides algorithms for simple and weighted voting, but it is extensible so that other combination algorithms can be added easily. We demonstrate the functionality of CombiTagger by using it to develop and evaluate combined taggers for Icelandic. The most accurate individual tagger obtains an accuracy of 91.83%. CombiTagger achieves 93.09%-93.41% accuracy by combining the output of five or six taggers using simple and weighted voting.