Europe
Disambiguation and Filtering Methods in Using Web Knowledge for Coreference Resolution
Uryupina, Olga (CiMEC, University of Trento) | Poesio, Massimo (CiMEC, University of Trento) | Giuliano, Claudio (Fondazione Bruno Kessler) | Tymoshenko, Kateryna (Fondazione Bruno Kessler)
We investigate two publicly available web knowledge bases, Wikipedia and Yago, in an attempt to leverage semantic information and increase the performance level of a state-of-the-art coreference resolution (CR) engine. We extract semantic compatibility and aliasing information from Wikipedia and Yago, and incorporate it into a CR system. We show that using such knowledge with no disambiguation and filtering does not bring any improvement over the baseline, mirroring the previous findings. We propose, therefore, a number of solutions to reduce the amount of noise coming from web resources: using disambiguation tools for Wikipedia, pruning Yago to eliminate the most generic categories and imposing additional constraints on affected mentions. Our evaluation experiments on the ACE-02 corpus show that the knowledge, extracted from Wikipedia and Yago, improves our system's performance by 2-3 percentage points.
Using Centrality Algorithms on Directed Graphs for Synonym Expansion
Sinha, Ravi Som (University of North Texas) | Mihalcea, Rada Flavia (University of North Texas)
This paper presents our explorations in using graph centrality measures to solve the synonym expansion problem. In particular, we use the concept of directional similarity to derive directed graphs on which we apply centrality algorithms to identify the most likely synonyms for a target word in a given context. We show that our method can lead to performance comparable to the state-of-the-art.
Event Extraction Approach for French Language
Sellmi, Oussama (SOIE, ISG de Tunis)
S. Tenier, A. Napoli, X. Polanco and Y.Toussaint (2006) With the proliferation of news articles from thousands of developed an automatic WebPages semantic annotation different sources now available on the Web, summarization system. The objective is to classify pages concerning teams of such information is becoming increasingly important. of research, in order to be able to determine for example Considering the large number of news source (for who works where, on what and with whom (use of examples, BBC, Reuters, CNN…), every day, thousands of ontology of the domain). It consists, first, of the articles are produced in the entire world concerning a given identification of the syntactic structure characterizing the event.
Evaluating Semantic Metrics on Tasks of Concept Similarity
Schwartz, Hansen Andrew (University of Central Florida) | Gomez, Fernando (University of Central Florida)
This study presents an evaluation of WordNet-based semantic similarity and relatedness measures in tasks focused on concept similarity. Assuming similarity as distinct from relatedness, the goal is to fill a gap within the current body of work in the evaluation of similarity and relatedness measures. Past studies have either focused entirely on relatedness or only evaluated judgments over words rather than concepts. In this study, first, concept similarity measures are evaluated over human judgments by using existing sets of word similarity pairs that we annotated with word senses. Next, an application-oriented study is presented by integrating similarity and relatedness measures into an algorithm which relies on concept similarity. Interestingly, the results find metrics categorized as measuring relatedness to be strongest in correlation with human judgments of concept similarity, though the difference in correlation is small. On the other hand, an information content metric, categorized as measuring similarity, is notably strongest according to the application-oriented evaluation.
A Linguistic Analysis of Student-Generated Paraphrases
Rus, Vasile (The University of Memphis) | Feng, Shi (The University of Memphis) | Brandon, Russell (The University of Memphis) | Crossley, Scott (Georgia State University) | McNamara, Danielle S. (The University of Memphis)
Paraphrase identification is a core Natural Language Processing task that involves assessing the semantic similarity of two texts. To foster systematic studies of this task, standardized datasets were created on which various approaches could be compared more fairly. However, a better understanding and more precise operational definition of a paraphrase are needed before any further datasets or systematic evaluations of the task of paraphrase identification are proposed. This study develops the concept of paraphrasing as a writing strategy. Six types of paraphrases are defined through the creation of a relatively large corpus of student-generated paraphrases. These paraphrases are analyzed along several dozen linguistic dimensions ranging from cohesion to lexical diversity. The most significant indices from these dimensions were then used to build a prediction model that could identify true and false paraphrases and each of the six paraphrase types.
Fairy Tales and ESL Texts: An Analysis of Linguistic Features Using the Gramulator
Rufenacht, Rachel M. (University of Memphis) | McCarthy, Philip M. (University of Memphis) | Lamkin, Travis A (University of Memphis)
Using the Gramulator, we analyzed the linguistic features of ESL texts and fairy tales. Our goal was to determine if fairy tales had the potential to be used as reading material for English language learners. The results of our analyses suggest that there are significant similarities between fairy tales and ESL texts, but that differences lie in the content of the text types with fairy tales appearing significantly more narrative in style and ESL texts appearing more expository.
Student Speech Act Classification Using Machine Learning
Rasor, Travis (University of Memphis) | Olney, Andrew ( University of Memphis ) | D' ( University of Memphis ) | Mello, Sidney
Dialogue-based intelligent tutoring systems use speech act classifiers to categorize student input into answers, questions, and other speech acts. Previous work has primarily focused on question classification. In this paper, we present a complimentary speech act classifier that focuses primarily on non-questions, which was developed using machine learning techniques. Our results show that an effective speech act classifier can be developed directly from labeled data using decision trees.
Given Bilingual Terminology in Statistical Machine Translation: MWE-Sensitve Word Alignment and Hierarchical Pitman-Yor Process-Based Translation Model Smoothing
Okita, Tsuyoshi (Dublin City University) | Way, Andy (Dublin City University)
This paper considers a scenario when we are given almost perfect knowledge about bilingual terminology in terms of a test corpus in Statistical Machine Translation (SMT). When the given terminology is part of a training corpus, one natural strategy in SMT is to use the trained translation model ignoring the given terminology. Then, two questions arises here. 1) Can a word aligner capture the given terminology? This is since even if the terminology is in a training corpus, it is often the case that a resulted translation model may not include these terminology. 2) Are probabilities in a translation model correctly calculated? In order to answer these questions, we did experiment introducing a Multi-Word Expression-sensitive (MWE-sensitive) word aligner and a hierarchical Pitman-Yor process-based translation model smoothing. Using 200k JP--EN NTCIR corpus, our experimental results show that if we introduce an MWE-sensitive word aligner and a new translation model smoothing, the overall improvement was 1.35 BLEU point absolute and 6.0% relative compared to the case we do not introduce these two.
Dissimilarity Kernels for Paraphrase Identification
Lintean, Mihai (University of Memphis) | Rus, Vasile ( University of Memphis )
We present in this paper a novel solution to the problem of paraphrase identification based on lexical dissimilarity kernels. Lexical kernels in conjunction with Support Vector Machines are preferred over other learning methods, e.g. decision trees, due to their ability to handle a high number of features. Dissimilarity-based kernels emphasize dissimilarities among text fragments and therefore are appropriate for text similarity tasks characterized by high lexical overlap. We conducted experiments with our kernels on the Microsoft Research (MSR) Paraphrase Corpus, a standardized data set used for assessing approaches to paraphrase identification. Our reported accuracy results are competitive and robust when compared to state-of-the-art single-model approaches. The results were obtained using 10-fold cross-validation over the entire corpus. We also report competitive results on the test portion of the MSR Paraphrase Corpus, which is the standard way to report results on this corpus.
The Hierarchy of Detective Fiction: A Gramulator Analysis
Lamkin, Travis Alan (University of Memphis) | McCarthy, Philip (University of Memphis)
Closely related genres have complex interrelations. An antecedent genre can constrain a subsequent genre, but changing rhetorical situations can lead to distinctions between an antecedent and its descendent. In this study, we assess two genres of detective fiction to determine their hierarchical relation to one another. We use the Gramulator, a computational tool that identifies indicative lexical features, to explain the relationship between whodunit fiction and hardboiled fiction . We conclude, based on the indicative lexical features of the expositions in texts, that the two are sibling genres.