Goto

Collaborating Authors

 Asia


Diversifying Query Suggestion Results

AAAI Conferences

In order to improve the user search experience, Query Suggestion, a technique for generating alternative queries to Web users, has become an indispensable feature for commercial search engines. However, previous work mainly focuses on suggesting relevant queries to the original query while ignoring the diversity in the suggestions, which will potentially dissatisfy Web users' information needs. In this paper, we present a novel unified method to suggest both semantically relevant and diverse queries to Web users. The proposed approach is based on Markov random walk and hitting time analysis on the query-URL bipartite graph. It can effectively prevent semantically redundant queries from receiving a high rank, hence encouraging diversities in the results. We evaluate our method on a large commercial clickthrough dataset in terms of relevance measurement and diversity measurement. The experimental results show that our method is very effective in generating both relevant and diverse query suggestions.


Sentiment Analysis with Global Topics and Local Dependency

AAAI Conferences

With the development of Web 2.0, sentiment analysis has now become a popular research problem to tackle. Recently, topic models have been introduced for the simultaneous analysis for topics and the sentiment in a document. These studies, which jointly model topic and sentiment, take the advantage of the relationship between topics and sentiment, and are shown to be superior to traditional sentiment analysis tools. However, most of them make the assumption that, given the parameters, the sentiments of the words in the document are all independent. In our observation, in contrast, sentiments are expressed in a coherent way. The local conjunctive words, such as “and” or “but”, are often indicative of sentiment transitions. In this paper, we propose a major departure from the previous approaches by making two linked contributions. First, we assume that the sentiments are related to the topic in the document, and put forward a joint sentiment and topic model, i.e. Sentiment-LDA. Second, we observe that sentiments are dependent on local context. Thus, we further extend the Sentiment-LDA model to Dependency-Sentiment-LDA model by relaxing the sentiment independent assumption in Sentiment-LDA. The sentiments of words are viewed as a Markov chain in Dependency-Sentiment-LDA. Through experiments, we show that exploiting the sentiment dependency is clearly advantageous, and that the Dependency-Sentiment-LDA is an effective approach for sentiment analysis.


Learning to Predict Opinion Share in Social Networks

AAAI Conferences

Blogosphere and sites such as for social networking, There has been a variety of work on the voter model. Dynamical knowledge-sharing and media-sharing in the World Wide properties of the basic model, including how the degree Web have enabled to form various kinds of large social distribution and the network size affect the mean time networks, through which behaviors, ideas and opinions to reach consensus, have been extensively studied (Liggett can spread. Thus, substantial attention has been directed 1999; Sood and Redner 2005) from mathematical point to investigating the spread of influence in these networks of view. Several variants of the voter model are also investigated (Leskovec, Adamic, and Huberman 2007; Crandall et al.


Towards an Intelligent Code Search Engine

AAAI Conferences

Software developers increasingly rely on information from the Web, such as documents or code examples on Application Programming Interfaces (APIs), to facilitate their development processes. However, API documents often do not include enough information for developers to fully understand the API usages, while searching for good code examples requires non-trivial efforts. To address this problem, we propose a novel code search engine, combining the strength of browsing documents and searching for code examples, by returning documents embedded with high-quality code example summaries mined from the Web. Our evaluation results show that our approach provides code examples with high precision and boosts programmer productivity.


PR + RQ ≈ PQ: Transliteration Mining Using Bridge Language

AAAI Conferences

We address the problem of mining name transliterations from comparable corpora in languages P and Q in the following resource-poor scenario: Parallel names in PQ are not available for training. Parallel names in PR and RQ are available for training. We propose a novel solution for the problem by computing a common geometric feature space for P,Q and R where name transliterations are mapped to similar vectors. We employ Canonical Correlation Analysis (CCA) to compute the common geometric feature space using only parallel names in PR and RQ and without requiring parallel names in  PQ. We test our algorithm on data sets in several languages and show that it gives results comparable to the state-of-the-art transliteration mining algorithms that use parallel names in PQ for training.


Utilizing Context in Generative Bayesian Models for Linked Corpus

AAAI Conferences

In an interlinked corpus of documents, the context in which a citation appears provides extra information about the cited document. However, associating terms in the context to the cited document remains an open problem. We propose a novel document generation approach that statistically incorporates the context in which a document links to another document. We quantitatively show that the proposed generation scheme explains the linking phenomenon better than previous approaches. The context information along with the actual content of the document provides significant improvements over the previous approaches for various real world evaluation tasks such as link prediction and log-likelihood estimation on unseen content. The proposed method is more scalable to large collection of documents compared to the previous approaches.


Prioritization of Domain-Specific Web Information Extraction

AAAI Conferences

It is often desirable to extract structured information from raw web pages for better information browsing, query answering, and pattern mining. many such Information Extraction (IE) technologies are costly and applying them at the web-scale is impractical. In this paper, we propose a novel prioritization approach where candidate pages from the corpus are ordered according to their expected contribution to the extraction results and those with higher estimated potential are extracted earlier. Systems employing this approach can stop the extraction process at any time when the resource gets scarce (i.e., not all pages in the corpus can be processed), without worrying about wasting extraction effort on unimportant pages. More specifically, we define a novel notion to measure the value of extraction results and design various mechanisms for estimating a candidate page’s contribution to this value. We further design and build the Extraction Prioritization (EP) system with efficient scoring and scheduling algorithms, and experimentally demonstrate that EP significantly outperforms the naive approach and is more flexible than the classifier approach.


Optimal Strategies for Reviewing Search Results

AAAI Conferences

Web search engines respond to a query by returning more results than can be reasonably reviewed. These results typically include the title, link, and snippet of content from the target link. Each result has the potential to be useful or useless and thus reviewing it has a cost and potential benefit. This paper studies the behavior of a rational agent in this setting, whose objective is to maximize the probability of finding a satisfying result while minimizing cost. We propose two similar agents with different capabilities: one that only compares result snippets relatively and one that predicts from the result snippet whether the result will be satisfying. We prove that the optimal strategy for both agents is a stopping rule: the agent reviews a fixed number of results until the marginal cost is greater than the marginal expected benefit, maximizing the overall expected utility. Finally, we discuss the relationship between rational agents and search users and how our findings help us understand reviewing behaviors.


Visual Contextual Advertising: Bringing Textual Advertisements to Images

AAAI Conferences

Advertising in the case of textual Web pages has been studied extensively by many researchers. However, with the increasing amount of multimedia data such as image, audio and video on the Web, the need for recommending advertisement for the multimedia data is becoming a reality. In this paper, we address the novel problem of visual contextual advertising, which is to directly advertise when users are viewing images which do not have any surrounding text. A key challenging issue of visual contextual advertising is that images and advertisements are usually represented in image space and word space respectively, which are quite different with each other inherently. As a result, existing methods for Web page advertising are inapplicable since they represent both Web pages and advertisement in the same word space. In order to solve the problem, we propose to exploit the social Web to link these two feature spaces together. In particular, we present a unified generative model to integrate advertisements, words and images. Specifically, our solution combines two parts in a principled approach: First, we transform images from a image feature space to a word space utilizing the knowledge from images with annotations from social Web. Then, a language model based approach is applied to estimate the relevance between transformed images and advertisements. Moreover, in this model, the probability of recommending an advertisement can be inferred efficiently given an image, which enables potential applications to online advertising.


Toward an Architecture for Never-Ending Language Learning

AAAI Conferences

We consider here the problem of building a never-ending language learner; that is, an intelligent computer agent that runs forever and that each day must (1) extract, or read, information from the web to populate a growing structured knowledge base, and (2) learn to perform this task better than on the previous day. In particular, we propose an approach and a set of design principles for such an agent, describe a partial implementation of such a system that has already learned to extract a knowledge base containing over 242,000 beliefs with an estimated precision of 74% after running for 67 days, and discuss lessons learned from this preliminary attempt to build a never-ending learning agent.