Gruhl, Daniel
Symbiotic Cognitive Computing through Iteratively Supervised Lexicon Induction
Alba, Alfredo (IBM Research) | Drews, Clemens (IBM Research) | Gruhl, Daniel (IBM Research) | Lewis, Neal (IBM Research) | Mendes, Pablo N. (IBM Research) | Nagarajan, Meenakshi (IBM Research) | Welch, Steve (IBM Research) | Coden, Anni (IBM Research) | Qadir, Ashequl (University of Utah)
In this paper we approach a subset of semantic analysis tasks through a symbiotic cognitive computing approach -- the user and the system learn from each other and accomplish the tasks better than they would do on their own. Our approach starts with a domain expert building a simplified domain model (e.g. semantic lexicons) and annotating documents with that model. The system helps the user by allowing them to obtain quicker results, and by leading them to refine their understanding of the domain. Meanwhile, through the feedback from the user, the system adapts more quickly and produces more accurate results. We believe this virtuous cycle is key for building next generation high quality semantic analysis systems. We present some preliminary findings and discuss our results on four aspects of this virtuous cycle, namely: the intrinsic incompleteness of semantic models, the need for a human in the loop, the benefits of a computer in the loop and finally the overall improvements offered by the human-computer interaction in the process.
Semantic Lexicon Induction from Twitter with Pattern Relatedness and Flexible Term Length
Qadir, Ashequl ( University of Utah ) | Mendes, Pablo N. (IBM Research) | Gruhl, Daniel (IBM Research) | Lewis, Neal (IBM Research)
With the rise of social media, learning from informal text has become increasingly important. We present a novel semantic lexicon induction approach that is able to learn new vocabulary from social media. Our method is robust to the idiosyncrasies of informal and open-domain text corpora. Unlike previous work, it does not impose restrictions on the lexical features of candidate terms — e.g. by restricting entries to nouns or noun phrases —while still being able to accurately learn multiword phrases of variable length. Starting with a few seed terms for a semantic category, our method first explores the context around seed terms in a corpus, and identifies context patterns that are relevant to the category. These patterns are used to extract candidate terms — i.e. multiword segments that are further analyzed to ensure meaningful term boundary segmentation. We show that our approach is able to learn high quality semantic lexicons from informally written social media text of Twitter, and can achieve accuracy as high as 92% in the top 100 learned category members.