Asia
Mining Query Subtopics from Questions in Community Question Answering
Wu, Yu (Beihang University) | Wu, Wei (Microsoft Reasearch Asia) | Li, Zhoujun (Beihang University) | Zhou, Ming (Microsoft Reasearch Asia)
This paper proposes mining query subtopics from questions in community question answering (CQA). The subtopics are represented as a number of clusters of questions with keywords summarizing the clusters. The task is unique in that the subtopics from questions can not only facilitate user browsing in CQA search, but also describe aspects of queries from a question-answering perspective. The challenges of the task include how to group semantically similar questions and how to find keywords capable of summarizing the clusters. We formulate the subtopic mining task as a non-negative matrix factorization (NMF) problem and further extend the model of NMF to incorporate question similarity estimated from metadata of CQA into learning. Compared with existing methods, our method can jointly optimize question clustering and keyword extraction and encourage the former task to enhance the latter. Experimental results on large scale real world CQA datasets show that the proposed method significantly outperforms the existing methods in terms of keyword extraction, while achieving a comparable performance to the state-of-the-art methods for question clustering.
Mining User Intents in Twitter: A Semi-Supervised Approach to Inferring Intent Categories for Tweets
Wang, Jinpeng (Peking University) | Cong, Gao (Nanyang Technological University) | Zhao, Xin Wayne (Renmin University of China) | Li, Xiaoming (Peking University)
In this paper, we propose to study the problem of identifying and classifying tweets into intent categories. For example, a tweet โI wanna buy a new carโ indicates the userโs intent for buying a car. Identifying such intent tweets will have great commercial value among others. In particular, it is important that we can distinguish different types of intent tweets. We propose to classify intent tweets into six categories, namely Food & Drink, Travel, Career & Education, Goods & Services, Event and Activities and Trifle. We propose a semisupervised learning approach to categorizing intent tweets into the six categories.We construct a test collection by using a bootstrap method. Our experimental results show that our approach is effective in inferring intent categories for tweets.
Relating Romanized Comments to News Articles by Inferring Multi-Glyphic Topical Correspondence
Tholpadi, Goutham (Indian Institute of Science, Bangalore) | Das, Mrinal Kanti (Indian Institute of Science, Bangalore) | Bansal, Trapit (Indian Institute of Science, Bangalore) | Bhattacharyya, Chiranjib (Indian Institute of Science, Bangalore)
Commenting is a popular facility provided by news sites. Analyzing such user-generated content has recently attracted research interest. However, in multilingual societies such as India, analyzing such user-generated content is hard due to several reasons: (1) There are more than 20 official languages but linguistic resources are available mainly for Hindi. It is observed that people frequently use romanized text as it is easy and quick using an English keyboard, resulting in multi-glyphic comments, where the texts are in the same language but in different scripts. Such romanized texts are almost unexplored in machine learning so far. (2) In many cases, comments are made on a specific part of the article rather than the topic of the entire article. Off-the-shelf methods such as correspondence LDA are insufficient to model such relationships between articles and comments. In this paper, we extend the notion of correspondence to model multi-lingual, multi-script, and inter-lingual topics in a unified probabilistic model called the Multi-glyphic Correspondence Topic Model (MCTM). Using several metrics, we verify our approach and show that it improves over the state-of-the-art.
A Hybrid Approach of Classifier and Clustering for Solving the Missing Node Problem
Sina, Sigal (Bar-Ilan University) | Rosenfeld, Avi (Jerusalem College of Technology) | Kraus, Sarit (Bar-Ilan University) | Akiva, Navot (Bar-Ilan University)
An important area of social network research is identifying missing information which is not explicitly represented in the network or is not visible to all. In this paper, we propose a novel Hybrid Approach of Classifier and Clustering,a which we refer to as HACC, to solve the missing node identification problem in social networks. HACC utilizes a classifier as a preprocessing step in order to integrate all known information into one similarity measure and then uses a clustering algorithm to identify missing nodes. Specifically, we used the information on the network structure, attributes about known users (nodes) and pictorial information to evaluate HACC and found that it performs significantly better than other missing node algorithms. We also argue that HACC is a general approach and domain independent and can be easily applied to other domains. We support this claim by evaluating HACC on a second authorship identification domain as well.
Question/Answer Matching for CQA System via Combining Lexical and Sequential Information
Shen, Yikang (Beihang University) | Rong, Wenge (Beihang University) | Sun, Zhiwei (Beihang University) | Ouyang, Yuanxin (Beihang University) | Xiong, Zhang (Beihang University)
Community-based Question Answering (CQA) has become popular in knowledge sharing sites since it allows users to get answers to complex, detailed, and personal questions directly from other users. Large archives of historical questions and associated answers have been accumulated. Retrieving relevant historical answers that best match a question is an essential component of a CQA service. Most state of the art approaches are based on bag-of-words models, which have been proven successful in a range of text matching tasks, but are insufficient for capturing the important word sequence information in short text matching. In this paper, a new architecture is proposed to more effectively model the complicated matching relations between questions and answers. It utilises a similarity matrix which contains both lexical and sequential information. Afterwards the information is put into a deep architecture to find potentially suitable answers. The experimental study shows its potential in improving matching accuracy of question and answer.
Approximating Model-Based ABox Revision in DL-Lite: Theory and Practice
Qi, Guilin (Southeast University) | Wang, Zhe (Griffith University) | Wang, Kewen (Griffith University) | Fu, Xuefeng (Southeast University) | Zhuang, Zhiqiang (Griffith University)
Model-based approaches provide a semantically well justified way to revise ontologies. However, in general, model-based revision operators are limited due to lack of efficient algorithms and inexpressibility of the revision results. In this paper, we make both theoretical and practical contribution to efficient computation of model-based revisions in DL-Lite. Specifically, we show that maximal approximations of two well-known model-based revisions for DL-Lite_R can be computed using a syntactic algorithm. However, such a coincidence of model-based and syntactic approaches does not hold when role functionality axioms are allowed. As a result, we identify conditions that guarantee such a coincidence for DL-Lite_FR. Our result shows that both model-based and syntactic revisions can co-exist seamlessly and the advantages of both approaches can be taken in one revision operator. Based on our theoretical results, we develop a graph-based algorithm for the revision operat
Content-Based Collaborative Filtering for News Topic Recommendation
Lu, Zhongqi (Hong Kong University of Science and Technology) | Dou, Zhicheng (Renmin University of China) | Lian, Jianxun (Microsoft Research) | Xie, Xing (Microsoft Research) | Yang, Qiang (Hong Kong University of Science and Technology)
News recommendation has become a big attraction with which major Web search portals retain their users. Two effective approaches are Content-based Filtering and Collaborative Filtering, each serving a specific recommendation scenario. The Content-based Filtering approaches inspect rich contexts of the recommended items, while the Collaborative Filtering approaches predict the interests of long-tail users by collaboratively learning from interests of related users. We have observed empirically that, for the problem of news topic displaying, both the rich context of news topics and the long-tail users exist. Therefore, in this paper, we propose a Content-based Collaborative Filtering approach (CCF) to bring both Content-based Filtering and Collaborative Filtering approaches together. We found that combining the two is not an easy task, but the benefits of CCF are impressive. On one hand, CCF makes recommendations based on the rich contexts of the news. On the other hand, CCF collaboratively analyzes the scarce feedbacks from the long-tail users. We tailored this CCF approach for the news topic displaying on the Bing front page and demonstrated great gains in attracting users. In the experiments and analyses part of this paper, we discuss the performance gains and insights in news topic recommendation in Bing.
Multi-Document Summarization Based on Two-Level Sparse Representation Model
Liu, He (Peking University) | Yu, Hongliang (Peking University) | Deng, Zhi-Hong (Peking University)
Multi-document summarization is of great value to many real world applications since it can help people get the main ideas within a short time.In this paper, we tackle the problem of extracting summary sentences from multi-document sets by applying sparse coding techniques and present a novel framework to this challenging problem. Based on the data reconstruction and sentence denoising assumption, we present a two-level sparse representation model to depict the process of multi-document summarization. Three requisite properties is proposed to form an ideal reconstructable summary: Coverage, Sparsity and Diversity. We then formalize the task of multi-document summarization as an optimization problem according to the above properties, and use simulated annealing algorithm to solve it.Extensive experiments on summarization benchmark data sets DUC2006 and DUC2007 show that our proposed model is effective and outperforms the state-of-the-art algorithms.
Cross-Modal Image Clustering via Canonical Correlation Analysis
Jin, Cheng (Fudan Univeristy) | Mao, Wenhui (Fudan Univeristy) | Zhang, Ruiqi (Fudan Univeristy) | Zhang, Yuejie (Fudan University) | Xue, Xiangyang (Fudan University)
A new algorithm via Canonical Correlation Analysis (CCA) is developed in this paper to support more effective cross-modal image clustering for large-scale annotated image collections. It can be treated as a bi-media multimodal mapping problem and modeled as a correlation distribution over multimodal feature representations. It integrates the multimodal feature generation with the Locality Linear Coding (LLC) and co-occurrence association network, multimodal feature fusion with CCA, and accelerated hierarchical k-means clustering, which aims to characterize the correlations between the inter-related visual features in images and semantic features in captions, and measure their association degree more precisely. Very positive results were obtained in our experiments using a large quantity of public data.
A Stochastic Model for Detecting Heterogeneous Link Communities in Complex Networks
He, Dongxiao (Tianjin University) | Liu, Dayou (Jilin University) | Jin, Di (Tianjin University) | Zhang, Weixiong (Washington University in Saint Louis)
Discovery of communities in networks is a fundamental data analysis problem. Most of the existing approaches have focused on discovering communities of nodes, while recent studies have shown great advantages and utilities of the knowledge of communities of links. Stochastic models provides a promising class of techniques for the identification of modular structures, but most stochastic models mainly focus on the detection of node communities rather than link communities. We propose a stochastic model, which not only describes the structure of link communities, but also considers the heterogeneous distribution of community sizes, a property which is often ignored by other models. We then learn the model parameters using a method of maximum likelihood based on an expectation-maximization algorithm. To deal with large complex real networks, we extend the method by a strategy of iterative bipartition. The extended method is not only efficient, but is also able to determine the number of communities for a given network. We test our approach on both synthetic benchmarks and real-world networks including an application to a large biological network, and also compare it with two existing methods. The results demonstrate the superior performance of our approach over the competing methods for detecting link communities.