Asia
Learning to Identify Review Spam
Li, Fangtao Huang (Tsinghua University) | Huang, Minlie (Tsinghua University) | Yang, Yi (Tsinghua University) | Zhu, Xiaoyan (Tsinghua University)
In the past few years, sentiment analysis and opinion mining becomes a popular and important task. These studies all assume that their opinion resources are real and trustful. However, they may encounter the faked opinion or opinion spam problem. In this paper, we study this issue in the context of our product review mining system. On product review site, people may write faked reviews, called review spam, to promote their products, or defame their competitors' products. It is important to identify and filter out the review spam. Previous work only focuses on some heuristic rules, such as helpfulness voting, or rating deviation, which limits the performance of this task. In this paper, we exploit machine learning methods to identify review spam. Toward the end, we manually build a spam collection from our crawled reviews. We first analyze the effect of various features in spam identification. We also observe that the review spammer consistently writes spam. This provides us another view to identify review spam: we can identify if the author of the review is spammer. Based on this observation, we provide a two-view semi-supervised method, co-training, to exploit the large amount of unlabeled data. The experiment results show that our proposed method is effective. Our designed machine learning methods achieve significant improvements in comparison to the heuristic baselines.
Resource-Bounded Crowd-Sourcing of Commonsense Knowledge
Kuo, Yen-Ling (National Taiwan University) | Hsu, Jane Yung-jen (National Taiwan University)
Knowledge acquisition is the essential process of extracting and encoding knowledge, both domainspecific and commonsense, to be used in intelligent systems. While many large knowledge bases have been constructed, none is close to complete. This paper presents an approach to improving a knowledge base efficiently under resource constraints. Using a guiding knowledge base, questions are generated from a weak form of similarity-based inference given the glossary mapping between two knowledge bases. The candidate questions are prioritized in terms of the concept coverage of the target knowledge. Experiments were conducted to find questions to grow the Chinese ConceptNet using the English ConceptNet as a guide. The results were evaluated by online users to verify that 94.17% of the questions and 85.77% of the answersare good. In addition, the answers collected in a six-week period showed consistent improvement to a 36.33% increase in concept coverage of the Chinese commonsense knowledge base against the English ConceptNet.
A New Search Engine Integrating Hierarchical Browsing and Keyword Search
Kuang, Da (The University of Western Ontario) | Li, Xiao (The University of Western Ontario) | Ling, Charles X. (The University of Western Ontario)
The original Yahoo! search engine consists of manually organized topic hierarchy of webpages for easy browsing. Modern search engines (such as Google and Bing), on the other hand, return a flat list of webpages based on keywords. It would be ideal if hierarchical browsing and keyword search can be seamlessly combined. The main difficulty in doing so is to automatically (i.e., not manually) classify and rank a massive number of webpages into various hierarchies (such as topics, media types, regions of the world). In this paper we report our attempt towards building this integrated search engine, called SEE (Search Engine with hiErarchy). We implement a hierarchical classification system based on Support Vector Machines, and embed it in SEE. We also design a novel user interface that allows users to dynamically adjust their desire for a higher accuracy vs. more results in any (sub)category of the hierarchy. Though our current search engine is still small (indexing about 1.2 million webpages), the results, including a small user study, have shown a great promise for integrating such techniques in the next-generation search engine.
Learning Compact Visual Descriptor for Low Bit Rate Mobile Landmark Search
Ji, Rongrong (Peking University and Harbin Institute of Technology) | Duan, Ling-Yu (Peking University) | Chen, Jie (Peking University) | Yao, Hongxun (Harbin Institute of Technology) | Huang, Tiejun (Peking University) | Gao, Wen (Peking University)
In this paper, we propose to extract a compact yet discriminative visual descriptor directly on the mobile device, which tackles the wireless query transmission latency in mobile landmark search. This descriptor is offline learnt from the location contexts of geo-tagged Web photos from both Flickr and Panoramio with two phrases: First, we segment the landmark photo collections into discrete geographical regions using a Gaussian Mixture Model [Stauffer et al., 2000]. Second, a ranking sensitive vocabulary boosting is introduced to learn a compact codebook within each region. To tackle the locally optimal descriptor learning caused by imprecise geographical segmentation, we further iterate above phrases by feedback an “entropy” based descriptor compactness into a prior distribution to constrain the Gaussian mixture modeling. Consequently, when entering a specific geographical region, the codebook in the mobile device is downstream adapted, which ensures efficient extraction of compact descriptor, its low bit rate transmission, as well as promising discrimination ability. We deploy our descriptor within both HTC and iPhone mobile phones, testing landmark search in typical areas included Beijing, New York, and Barcelona containing one million images. Our learning descriptor outperforms alternative compact descriptors [Chen et al., 2009][Chen et al., 2010][Chandrasekhar et al., 2009a][Chandrasekhar et al., 2009b] with a large margin.
Exploiting Probabilistic Knowledge under Uncertain Sensing for Efficient Robot Behaviour
Hanheide, Marc (University of Birmingham) | Gretton, Charles (University of Birmingham) | Dearden, Richard W (University of Birmingham) | Hawes, Nick A (University of Birmingham) | Wyatt, Jeremy L (University of Birmingham) | Pronobis, Andrzej (KTH Stockholm) | Aydemir, Alper (KTH Stockholm) | Göbelbecker, Moritz (University of Freiburg) | Zender, Hendrik (DFKI Saarbrücken GmbH)
Robots must perform tasks efficiently and reliably while acting underuncertainty. One way to achieve efficiency is to give the robot common-sense knowledge about the structure of the world. Reliable robot behaviour can be achieved by modelling the uncertaintyin the world probabilistically. We present a robot system that combines these two approaches and demonstrate the improvements in efficiency and reliability that result. Our first contribution is a probabilistic relational model integrating common-sense knowledge about the world in general, with observations of a particular environment. Our second contribution is a continual planning system which is able to plan in the large problems posed by that model, by automatically switching between decision-theoretic and classical procedures. We evaluate our system on object search tasks in two different real-world indoor environments. By reasoning about the trade-offs between possible courses of action with different informational effects, and exploiting the cues and general structures of those environments, our robot is able to consistently demonstrate efficient and reliable goal-directed behaviour.
Simulation-Based Data Mining Solution to the Structure of Water Surrounding Proteins
Ho, Bao Tu (Japan Advanced Institute of Science and Technology) | Dam, Chi Hieu (Japan Advanced Institute of Science and Technology) | Sugiyama, Ayumu (Japan Science and Technology Agency)
It is well known that the three water categories science. Methods in biophysics only provide qualitative have different functions. Individually bound water has multiple description of the structure and thus clarifying contacts that stabilize the protein structure. Hydration the collective phenomena of a huge number water has heterogeneous dynamical behavior, contributing to of water molecules is still beyond intuition protein folding, stability and dynamics, and interacting with in biophysics. We introduce a simulation-based the bulk water. Bulk water is free to move and continuously data mining approach that quantitatively model the exchanges with hydration water, and indirectly influences on structure of water surrounding a protein as clusters the protein [Bizzarri and Cannistraro, 2002], [Halle, 2004]. of water molecules having similar moving behavior. Much effort has been devoted to quantitatively model the The paper presents and explains how the advances relative motion (orientation, rotation and velocity) and dynamical of AI technique can potentially solve this properties of individual water molecules in protein challenging data-intensive problem.
Efficient Searching Top-k Semantic Similar Words
Yang, Zhenglu (The University of Tokyo) | Kitsuregawa, Masaru (The University of Tokyo)
Measuring the semantic meaning between words is an important issue because it is the basis for many applications, such as word sense disambiguation, document summarization, and so forth. Although it has been explored for several decades, most of the studies focus on improving the effectiveness of the problem, i.e., precision and recall. In this paper, we propose to address the efficiency issue, that given a collection of words, how to efficiently discover the top-k most semantic similar words to the query. This issue is very important for real applications yet the existing state-of-the-art strategies cannot satisfy users with reasonable performance. Efficient strategies on searching top-k semantic similar words are proposed. We provide an extensive comparative experimental evaluation demonstrating the advantages of the introduced strategies over the state-of-the-art approaches.
Mining User Dwell Time for Personalized Web Search Re-Ranking
Xu, Songhua (Oak Ridge National Laboratory) | Jiang, Hao (The University of Hong Kong) | Lau, Francis Chi-Moon (The University of Hong Kong)
We propose a personalized re-ranking algorithm through mining user dwell times derived from a user's previously online reading or browsing activities. We acquire document level user dwell times via a customized web browser, from which we then infer concept word level user dwell times in order to understand a user's personal interest. According to the estimated concept word level user dwell times, our algorithm can estimate a user's potential dwell time over a new document, based on which personalized webpage re-ranking can be carried out. We compare the rankings produced by our algorithm with rankings generated by popular commercial search engines and a recently proposed personalized ranking algorithm. The results clearly show the superiority of our method.
Predicting Epidemic Tendency through Search Behavior Analysis
Xu, Danqing (Tsinghua University) | Liu, Yiqun (Tsinghua University) | Zhang, Min (Tsinghua University) | Ma, Shaoping (Tsinghua University) | Cui, Anqi (Tsinghua University) | Ru, Liyun (Tsinghua University)
The possibility that influenza activity can be generally detected through search log analysis has been explored in recent years. However, previous studies have mainly focused on influenza, and little attention has been paid to other epidemics. With an analysis of web user behavior data, we consider the problem of predicting the tendency of hand-foot -and-mouth disease (HFMD), whose out-break in 2010 resulted in a great panic in China. In addi-tion to search queries, we consider users’ interactions with search engines. Given the collected search logs, we cluster HFMD-related search queries, medical pages and news reports into the following sets: epidemic-related queries (ERQs), epidemic-related pages (ERPs) and ep-idemic-related news (ERNs). Furthermore, we count their own frequencies as different features, and we conduct a regression analysis with current HFMD occurrences. The experimental results show that these features exhibit good performances on both accuracy and timeliness.
Source-Selection-Free Transfer Learning
Xiang, Evan Wei (The Hong Kong University of Science and Technology) | Pan, Sinno Jialin (Institute for Infocomm Research) | Pan, Weike (The Hong Kong University of Science and Technology) | Su, Jian (Institute for Infocomm Research) | Yang, Qiang (The Hong Kong University of Science and Technology)
Transfer learning addresses the problems that labeled training data are insufficient to produce a high-performance model. Typically, given a target learning task, most transfer learning approaches require to select one or more auxiliary tasks as sources by the designers. However, how to select the right source data to enable effective knowledge transfer automatically is still an unsolved problem, which limits the applicability of transfer learning. In this paper, we take one step ahead and propose a novel transfer learning framework, known as source-selection-free transfer learning (SSFTL), to free users from the need to select source domains. Instead of asking the users for source and target data pairs, as traditional transfer learning does, SSFTL turns to some online information sources such as World Wide Web or the Wikipedia for help. The source data for transfer learning can be hidden somewhere within this large online information source, but the users do not know where they are. Based on the online information sources, we train a large number of classifiers. Then, given a target task, a bridge is built for labels of the potential source candidates and the target domain data in SSFTL via some large online social media with tag cloud as a label translator. An added advantage of SSFTL is that, unlike many previous transfer learning approaches, which are difficult to scale up to the Web scale, SSFTL is highly scalable and can offset much of the training work to offline stage. We demonstrate the effectiveness and efficiency of SSFTL through extensive experiments on several real-world datasets in text classification.