Industry
Improving Spoken Dialogue Understanding Using Phonetic Mixture Models
Wang, William Yang (Columbia University) | Artstein, Ron (USC Institute for Creative Technologies) | Leuski, Anton (USC Institute for Creative Technologies) | Traum, David (USC Institute for Creative Technologies)
Augmenting word tokens with a phonetic representation, derived from a dictionary, improves the performance of a Natural Language Understanding component that interprets speech recognizer output: we observed a 5% to 7% reduction in errors across a wide range of response return rates. The best performance comes from mixture models incorporating both word and phone features. Since the phonetic representation is derived from a dictionary, the method can be applied easily without the need for integration with a specific speech recognizer. The method has similarities with autonomous (or bottom-up) psychological models of lexical access, where contextual information is not integrated at the stage of auditory perception but rather later.
Hybrid Approach Combining Machine Learning and a Rule-Based Expert System for Text Categorization
Villena-Román, Julio (Universidad Carlos III de Madrid) | Collada-Pérez, Sonia (Daedalus - Data, Decisions and Language, S.A.) | Lana-Serrano, Sara (Universidad Politécnica de Madrid) | González-Cristóbal, José Carlos (Universidad Politécnica de Madrid)
This paper discusses a novel hybrid approach for text categorization that combines a machine learning algorithm, which provides a base model trained with a labeled corpus, with a rule-based expert system, which is used to improve the results provided by the previous classifier, by filtering false positives and dealing with false negatives. The main advantage is that the system can be easily fine-tuned by adding specific rules for those noisy or conflicting categories that have not been successfully trained. We also describe an implementation based on k-Nearest Neighbor and a simple rule language to express lists of positive, negative and relevant (multiword) terms appearing in the input text. The system is evaluated in several scenarios, including the popular Reuters-21578 news corpus for comparison to other approaches, and categorization using IPTC metadata, EUROVOC thesaurus and others. Results show that this approach achieves a precision that is comparable to top ranked methods, with the added value that it does not require a demanding human expert workload to train.
Disambiguation and Filtering Methods in Using Web Knowledge for Coreference Resolution
Uryupina, Olga (CiMEC, University of Trento) | Poesio, Massimo (CiMEC, University of Trento) | Giuliano, Claudio (Fondazione Bruno Kessler) | Tymoshenko, Kateryna (Fondazione Bruno Kessler)
We investigate two publicly available web knowledge bases, Wikipedia and Yago, in an attempt to leverage semantic information and increase the performance level of a state-of-the-art coreference resolution (CR) engine. We extract semantic compatibility and aliasing information from Wikipedia and Yago, and incorporate it into a CR system. We show that using such knowledge with no disambiguation and filtering does not bring any improvement over the baseline, mirroring the previous findings. We propose, therefore, a number of solutions to reduce the amount of noise coming from web resources: using disambiguation tools for Wikipedia, pruning Yago to eliminate the most generic categories and imposing additional constraints on affected mentions. Our evaluation experiments on the ACE-02 corpus show that the knowledge, extracted from Wikipedia and Yago, improves our system's performance by 2-3 percentage points.
Event Extraction Approach for French Language
Sellmi, Oussama (SOIE, ISG de Tunis)
S. Tenier, A. Napoli, X. Polanco and Y.Toussaint (2006) With the proliferation of news articles from thousands of developed an automatic WebPages semantic annotation different sources now available on the Web, summarization system. The objective is to classify pages concerning teams of such information is becoming increasingly important. of research, in order to be able to determine for example Considering the large number of news source (for who works where, on what and with whom (use of examples, BBC, Reuters, CNN…), every day, thousands of ontology of the domain). It consists, first, of the articles are produced in the entire world concerning a given identification of the syntactic structure characterizing the event.
A Linguistic Analysis of Student-Generated Paraphrases
Rus, Vasile (The University of Memphis) | Feng, Shi (The University of Memphis) | Brandon, Russell (The University of Memphis) | Crossley, Scott (Georgia State University) | McNamara, Danielle S. (The University of Memphis)
Paraphrase identification is a core Natural Language Processing task that involves assessing the semantic similarity of two texts. To foster systematic studies of this task, standardized datasets were created on which various approaches could be compared more fairly. However, a better understanding and more precise operational definition of a paraphrase are needed before any further datasets or systematic evaluations of the task of paraphrase identification are proposed. This study develops the concept of paraphrasing as a writing strategy. Six types of paraphrases are defined through the creation of a relatively large corpus of student-generated paraphrases. These paraphrases are analyzed along several dozen linguistic dimensions ranging from cohesion to lexical diversity. The most significant indices from these dimensions were then used to build a prediction model that could identify true and false paraphrases and each of the six paraphrase types.
Fairy Tales and ESL Texts: An Analysis of Linguistic Features Using the Gramulator
Rufenacht, Rachel M. (University of Memphis) | McCarthy, Philip M. (University of Memphis) | Lamkin, Travis A (University of Memphis)
Using the Gramulator, we analyzed the linguistic features of ESL texts and fairy tales. Our goal was to determine if fairy tales had the potential to be used as reading material for English language learners. The results of our analyses suggest that there are significant similarities between fairy tales and ESL texts, but that differences lie in the content of the text types with fairy tales appearing significantly more narrative in style and ESL texts appearing more expository.
Automated Assessment of Paragraph Quality: Introduction, Body, and Conclusion Paragraphs
Roscoe, Rod (University of Memphis) | Crossley, Scott (Georgia State University) | Weston, Jennifer (University of Memphis) | McNamara, Danielle (University of Memphis)
Natural language processing and statistical methods were used to identify linguistic features associated with the quality of student-generated paragraphs. Linguistic features were assessed using Coh-Metrix. The resulting computational models demonstrated small to medium effect sizes for predicting paragraph quality: introduction quality r2 = .25, body quality r2 = .10, and conclusion quality r2 = .11. Although the variance explained was somewhat low, the linguistic features identified were consistent with the rhetorical goals of paragraph types. Avenues for bolstering this approach by considering individual writing styles and techniques are considered.
Student Speech Act Classification Using Machine Learning
Rasor, Travis (University of Memphis) | Olney, Andrew ( University of Memphis ) | D' ( University of Memphis ) | Mello, Sidney
Dialogue-based intelligent tutoring systems use speech act classifiers to categorize student input into answers, questions, and other speech acts. Previous work has primarily focused on question classification. In this paper, we present a complimentary speech act classifier that focuses primarily on non-questions, which was developed using machine learning techniques. Our results show that an effective speech act classifier can be developed directly from labeled data using decision trees.
Automatic Natural Language Processing and the Detection of Reading Skills and Reading Comprehension
Boonthum-Denecke, Chutima (Hampton University) | McCarthy, Philip (University of Memphis) | Lamkin, Travis (University of Memphis) | Jackson, G. Tanner (University of Memphis) | Magliano, Joseph P. (Northern Illinois University) | McNamara, Danielle S. (University of Memphis)
The primary goal of this study is to assess two approaches for detecting comprehension processes in R-SAT (Reading Strategy Assessment Tool). One approach is based on Latent Semantic Analysis (LSA) while the other is a combination of literal word matching and soundex. A secondary goal is to assess the potential for detecting specific reading comprehension strategies, either in isolation or combination. Participants typed “think-aloud” protocols while reading texts presented on computers. Human judges rated these protocols for the presence of the various reading comprehension strategies. LSA, word, and combined algorithms were compared and the results showed that a combination of both approaches yielded the best results. However, performance of the combined algorithm varied in terms of the type of processes and the grain size of the human coding system. Lastly, the use of reading strategies (either in isolation or combination) is positivity related to students’ Gates–MacGinitie reading comprehension scores, which illustrates the merit of this approach for assessing comprehension skill.
Shared Experiences, Shared Representations, and the Implications for Applied Natural Language Processing
Stent, Amanda J. (AT&T Labs &ndash)
When people interact with language-producing agents (other people or computers), they assume that the shared experience leads to shared representations — of the world, the interaction, and the language used in the interaction. This phenomenon occurs even during interaction with systems that give no evidence of building shared representations. The absence of shared representations leads to errors and delays; alternatively, even simple shared representations can lead to reduced error rates and more efficient interaction. In this talk, we present three case studies: a mobile local business search application that builds no interaction representations; a telephone-based recommendation and review system that builds limited representations of the shared language in the interaction; and computer models of coreference that use shared representations to permit both coreference resolution and referring expression generation. We lay out a range of possibilities for shared representations, show that they can be built incrementally as an interaction progresses, and point to possibilities for future work in probabilistic shared representations for interactive systems.