Goto

Collaborating Authors

 Country


Student Speech Act Classification Using Machine Learning

AAAI Conferences

Dialogue-based intelligent tutoring systems use speech act classifiers to categorize student input into answers, questions, and other speech acts. Previous work has primarily focused on question classification. In this paper, we present a complimentary speech act classifier that focuses primarily on non-questions, which was developed using machine learning techniques. Our results show that an effective speech act classifier can be developed directly from labeled data using decision trees.


Given Bilingual Terminology in Statistical Machine Translation: MWE-Sensitve Word Alignment and Hierarchical Pitman-Yor Process-Based Translation Model Smoothing

AAAI Conferences

This paper considers a scenario when we are given almost perfect knowledge about bilingual terminology in terms of a test corpus in Statistical Machine Translation (SMT). When the given terminology is part of a training corpus, one natural strategy in SMT is to use the trained translation model ignoring the given terminology. Then, two questions arises here. 1) Can a word aligner capture the given terminology? This is since even if the terminology is in a training corpus, it is often the case that a resulted translation model may not include these terminology. 2) Are probabilities in a translation model correctly calculated? In order to answer these questions, we did experiment introducing a Multi-Word Expression-sensitive (MWE-sensitive) word aligner and a hierarchical Pitman-Yor process-based translation model smoothing. Using 200k JP--EN NTCIR corpus, our experimental results show that if we introduce an MWE-sensitive word aligner and a new translation model smoothing, the overall improvement was 1.35 BLEU point absolute and 6.0% relative compared to the case we do not introduce these two.


Dissimilarity Kernels for Paraphrase Identification

AAAI Conferences

We present in this paper a novel solution to the problem of paraphrase identification based on lexical dissimilarity kernels. Lexical kernels in conjunction with Support Vector Machines are preferred over other learning methods, e.g. decision trees, due to their ability to handle a high number of features. Dissimilarity-based kernels emphasize dissimilarities among text fragments and therefore are appropriate for text similarity tasks characterized by high lexical overlap. We conducted experiments with our kernels on the Microsoft Research (MSR) Paraphrase Corpus, a standardized data set used for assessing approaches to paraphrase identification. Our reported accuracy results are competitive and robust when compared to state-of-the-art single-model approaches. The results were obtained using 10-fold cross-validation over the entire corpus. We also report competitive results on the test portion of the MSR Paraphrase Corpus, which is the standard way to report results on this corpus.


The Hierarchy of Detective Fiction: A Gramulator Analysis

AAAI Conferences

Closely related genres have complex interrelations. An antecedent genre can constrain a subsequent genre, but changing rhetorical situations can lead to distinctions between an antecedent and its descendent. In this study, we assess two genres of detective fiction to determine their hierarchical relation to one another. We use the Gramulator, a computational tool that identifies indicative lexical features, to explain the relationship between whodunit fiction and hardboiled fiction . We conclude, based on the indicative lexical features of the expositions in texts, that the two are sibling genres.


Domain Independent Knowledge Base Population from Structured and Unstructured Data Sources

AAAI Conferences

In this paper we introduce a system that is designed to automatically populate a knowledge base from both structured and unstructured text given an ontology. Our system is designed as a modular end-to-end system that takes structured or unstructured data as input, extracts information, maps relevant information to an ontology, and finally disambiguates entities in the knowledge base. The novelty of our approach is that it is domain independent and can easily be adapted to new ontologies and domains. Unlike most knowledge base population systems, ours includes entity detection. This feature allows one to employ very complex ontologies that include events and the entities that are involved in the events.


Simulating Human Ratings on Word Concreteness

AAAI Conferences

However, word concreteness is not an attribute that a A single word in the human language has many complex computer can directly compute. One means of assessing dimensions such as semantics, parts of speech, lexical type, the characteristics of words is by having humans rate them imagability, concreteness, familiarity, etc. It is important to on the dimensions of interest. Humans are proficient in know the dimensions of words in languages so that we can categorizing words into linguistic dimensions, but it is develop a better theoretical understanding of language and impractical to have humans rating tens of thousands of also to build tools that simulate human intelligence and words that we would need for psycholinguistic research.


Co-Occurrence-Based Error Correction Approach to Word Segmentation

AAAI Conferences

To overcome the problems in Thai word segmentation, a number of word segmentation has been proposed during the long period of time until today. We propose a novel Thai word segmentation approach so called Co-occurrence-Based Error Correction (CBEC). CBEC generates all possible segmentation candidates using the classical maximal matching algorithm and then selects the most accurate segmentation based on co-occurrence and an error correction algorithm. CBEC was trained and evaluated on BEST 2009 corpus.


Automatic Natural Language Processing and the Detection of Reading Skills and Reading Comprehension

AAAI Conferences

The primary goal of this study is to assess two approaches for detecting comprehension processes in R-SAT (Reading Strategy Assessment Tool). One approach is based on Latent Semantic Analysis (LSA) while the other is a combination of literal word matching and soundex. A secondary goal is to assess the potential for detecting specific reading comprehension strategies, either in isolation or combination. Participants typed “think-aloud” protocols while reading texts presented on computers. Human judges rated these protocols for the presence of the various reading comprehension strategies. LSA, word, and combined algorithms were compared and the results showed that a combination of both approaches yielded the best results. However, performance of the combined algorithm varied in terms of the type of processes and the grain size of the human coding system. Lastly, the use of reading strategies (either in isolation or combination) is positivity related to students’ Gates–MacGinitie reading comprehension scores, which illustrates the merit of this approach for assessing comprehension skill.


Automatic Reduction of a Document-Derived Noun Vocabulary

AAAI Conferences

We propose and evaluate five related algorithms that automatically derive limited-size noun vocabularies from text documents of 2,000-30,000 words.The proposed algorithms combine Personalized Page Rank and principles of information maximization, and are applied to the WordNet graph for nouns. For the best-performing algorithm the difference between automatically generated reduced noun lexicons and those created by human writers is approximately 1-2 WordNet edges per lexical item. Our results also indicate the importance of performing word-sense disambiguation with sentence-level context information at the earliest stage of analysis.


Commonsense Knowledge Extraction Using Concepts Properties

AAAI Conferences

This paper presents a semantically grounded method for extracting commonsense knowledge. First, commonsense rules are identified, e.g., one cannot see imaginary objects. Second, those rules are combined with a basic semantic representation in order to infer commonsense knowledge facts, e.g. one cannot see a flying carpet. Further combinations of semantic relations with inferred commonsense facts are proposed and analyzed. Results show that this novel method is able to extract thousands of commonsense facts with little human interaction and high accuracy.