Goto

Collaborating Authors

 Information Retrieval


Exemplar-Based Topic Detection in Twitter Streams

AAAI Conferences

Detecting topics in Twitter streams has been gaining an increasing amount of attention. It can be of great support for communities struck by natural disasters, and could assist companies and political parties understand users' opinions and needs. Traditional approaches for topic detection focus on representing topics using terms, are negatively affected by length limitation and the lack of context associated with tweets.In this work, we propose an Exemplar-based approach for topic detection, in which detected topics are represented using a few selected tweets. Using exemplar tweets instead of a set of key words allows for an easy interpretation of the meaning of the detected topics. Experimental evaluation on benchmark Twitter datasets shows that the proposed topic detection approach achieves the best term precision. It does this while maintaining good topic recall and running time compared to other approaches.


Construction of FuzzyFind Dictionary using Golay Coding Transformation for Searching Applications

arXiv.org Artificial Intelligence

Searching through a large volume of data is very critical for companies, scientists, and searching engines applications due to time complexity and memory complexity. In this paper, a new technique of generating FuzzyFind Dictionary for text mining was introduced. We simply mapped the 23 bits of the English alphabet into a FuzzyFind Dictionary or more than 23 bits by using more FuzzyFind Dictionary, and reflecting the presence or absence of particular letters. This representation preserves closeness of word distortions in terms of closeness of the created binary vectors within Hamming distance of 2 deviations. This paper talks about the Golay Coding Transformation Hash Table and how it can be used on a FuzzyFind Dictionary as a new technology for using in searching through big data. This method is introduced by linear time complexity for generating the dictionary and constant time complexity to access the data and update by new data sets, also updating for new data sets is linear time depends on new data points. This technique is based on searching only for letters of English that each segment has 23 bits, and also we have more than 23-bit and also it could work with more segments as reference table.


Named Entity Recognition in Travel-Related Search Queries

AAAI Conferences

This paper addresses the problem of named entity recognition (NER) in travel-related search queries. NER is an important step toward a richer understanding of user-generated inputs in information retrieval systems. NER in queries is challenging due to minimal context and few structural clues. NER in restricted-domain queries is useful in vertical search applications, for example following query classification in general search. This paper describes an efficient machine learning-based solution for the high-quality extraction of semantic entities from query inputs in a restricted-domain information retrieval setting. We apply a conditional random field (CRF) sequence model to travel-domain search queries and achieve high-accuracy results. Our approach yields an overall F1 score of 86.4% on a held-out test set, outperforming a baseline score of 82.0% on a CRF with standard features. The resulting NER classifier is currently in use in a real-life travel search engine.


Learning Word Vectors Efficiently Using Shared Representations and Document Representations

AAAI Conferences

We propose some better word embedding models based on vLBL model and ivLBL model by sharing representations between context and target words and using document representations. Our proposed models are much simpler which have almost half less parameters than the state-of-the-art methods. We achieve better results on word analogy task than the best ones reported before using significantly less training data and computing time.


Lower and Upper Bounds for SPARQL Queries over OWL Ontologies

AAAI Conferences

The paper presents an approach for optimizing the evaluation of SPARQL queries over OWL ontologies using SPARQL's OWL Direct Semantics entailment regime. The approach is based on the computation of lower and upper bounds, but we allow for much more expressive queries than related approaches. In order to optimize the evaluation of possible query answers in the upper but not in the lower bound, we present a query extension approach that uses schema knowledge from the queried ontology to extend the query with additional parts. We show that the resulting query is equivalent to the original one and we use the additional parts that are simple to evaluate for restricting the bounds of subqueries of the initial query. In an empirical evaluation we show that the proposed query extension approach can lead to a significant decrease in the query execution time of up to four orders of magnitude.


Entity Resolution in a Big Data Framework

AAAI Conferences

Entity Resolution (ER) concerns identifying logically equivalent pairs of entities that may be syntactically disparate. Although ER is a long-standing problem in the artificial intelligence community, the growth of Linked Open Data, a collection of semi-structured datasets published and inter-connected on the Web, mandates a new approach. The thesis is that building a viable Entity Resolution solution for serving Big Data needs requires simultaneously resolving challenges of automation, heterogeneity, scalability and domain independence. The dissertation aims to build such a system and evaluate it on real-world datasets published already as Linked Open Data.


Ordering-Sensitive and Semantic-Aware Topic Modeling

AAAI Conferences

Topic modeling of textual corpora is an important and challenging problem. In most previous work, the โ€œbag-of-wordsโ€ assumption is usually made which ignores the ordering of words. This assumption simplifies the computation, but it unrealistically loses the ordering information and the semantic of words in the context. In this paper, we present a Gaussian Mixture Neural Topic Model (GMNTM) which incorporates both the ordering of words and the semantic meaning of sentences into topic modeling. Specifically, we represent each topic as a cluster of multi-dimensional vectors and embed the corpus into a collection of vectors generated by the Gaussian mixture model. Each word is affected not only by its topic, but also by the embedding vector of its surrounding words and the context. The Gaussian mixture components and the topic of documents, sentences and words can be learnt jointly. Extensive experiments show that our model can learn better topics and more accurate word distributions for each topic. Quantitatively, comparing to state-of-the-art topic modeling approaches, GMNTM obtains significantly better performance in terms of perplexity, retrieval accuracy and classification accuracy.


Unsupervised Phrasal Near-Synonym Generation from Text Corpora

AAAI Conferences

Unsupervised discovery of synonymous phrases is useful in a variety of tasks ranging from text mining and search engines to semantic analysis and machine translation. This paper presents an unsupervised corpus-based conditional model: Near-Synonym System (NeSS) for finding phrasal synonyms and near synonyms that requires only a large monolingual corpus. The method is based on maximizing information-theoretic combinations of shared contexts and is parallelizable for large-scale processing. An evaluation framework with crowd-sourced judgments is proposed and results are compared with alternate methods, demonstrating considerably superior results to the literature and to thesaurus look up for multi-word phrases. Moreover, the results show that the statistical scoring functions and overall scalability of the system are more important than language specific NLP tools. The method is language-independent and practically useable due to accuracy and real-time performance via parallel decomposition.


Exploring Key Concept Paraphrasing Based on Pivot Language Translation for Question Retrieval

AAAI Conferences

Question retrieval in current community-based question answering (CQA) services does not, in general, work well for long and complex queries. One of the main difficulties lies in the word mismatch between queries and candidate questions. Existing solutions try to expand the queries at word level, but they usually fail to consider concept level enrichment. In this paper, we explore a pivot language translation based approach to derive the paraphrases of key concepts. We further propose a unified question retrieval model which integrates the keyconcepts and their paraphrases for the query question. Experimental results demonstrate that the paraphrase enhanced retrieval model significantly outperforms the state-of-the-art models in question retrieval.


Mining Query Subtopics from Questions in Community Question Answering

AAAI Conferences

This paper proposes mining query subtopics from questions in community question answering (CQA). The subtopics are represented as a number of clusters of questions with keywords summarizing the clusters. The task is unique in that the subtopics from questions can not only facilitate user browsing in CQA search, but also describe aspects of queries from a question-answering perspective. The challenges of the task include how to group semantically similar questions and how to find keywords capable of summarizing the clusters. We formulate the subtopic mining task as a non-negative matrix factorization (NMF) problem and further extend the model of NMF to incorporate question similarity estimated from metadata of CQA into learning. Compared with existing methods, our method can jointly optimize question clustering and keyword extraction and encourage the former task to enhance the latter. Experimental results on large scale real world CQA datasets show that the proposed method significantly outperforms the existing methods in terms of keyword extraction, while achieving a comparable performance to the state-of-the-art methods for question clustering.