Goto

Collaborating Authors

 Asia


Discovering Spammers in Social Networks

AAAI Conferences

As the popularity of the social media increases, as evidenced in Twitter, Facebook and China's Renren, spamming activities also picked up in numbers and variety. On social network sites, spammers often disguise themselves by creating fake accounts and hijacking normal users' accounts for personal gains. Different from the spammers in traditional systems such as SMS and email, spammers in social media behave like normal users and they continue to change their spamming strategies to fool anti spamming systems. However, due to the privacy and resource concerns, many social media websites cannot fully monitor all the contents of users, making many of the previous approaches, such as topology-based and content-classification-based methods, infeasible to use. In this paper, we propose a novel method for spammer detection in social networks that exploits both social activities as well as users' social relations in an innovative and highly scalable manner. The proposed method detects spammers following collective activities based on users' social actions and relations. We have empirically tested our method on data from Renren.com, which is the largest social network in China, and demonstrated that our new method can improve the detection performance significantly.


A Mouse-Trajectory Based Model for Predicting Query-URL Relevance

AAAI Conferences

For the learning-to-ranking algorithms used in commercial search engines, a conventional way to generate the training examples is to employ professional annotators to label the relevance of query-url pairs. Since label quality depends on the expertise of annotators to a large extent, this process is time-consuming and labor-intensive. Automatically generating labels from click-through data has been well studied to have comparable or better performance than human judges. Click-through data present users’ action and imply their satisfaction on search results, but exclude the interactions between users and search results beyond the page-view level (e.g., eye and mouse movements). This paper proposes a novel approach to comprehensively consider the information underlying mouse trajectory and click-through data so as to describe user behaviors more objectively and achieve a better understanding of the user experience. By integrating multi-sources data, the proposed approach reveals that the relevance labels of query-url pairs are related to positions of urls and users’ behavioral features. Based on their correlations, query-url pairs can be labeled more accurately and search results are more satisfactory to users. The experiments that are conducted on the most popular Chinese commercial search engine (Baidu) validated the rationality of our research motivation and proved that the proposed approach outperformed the state-of-the-art methods.


Multinomial Relation Prediction in Social Data: A Dimension Reduction Approach

AAAI Conferences

The recent popularization of social web services has made them one of the primary uses of the World Wide Web. An important concept in social web services is social actions such as making connections and communicating with others and adding annotations to web resources. Predicting social actions would improve many fundamental web applications, such as recommendations and web searches. One remarkable characteristic of social actions is that they involve multiple and heterogeneous objects such as users, documents, keywords, and locations. However, the high-dimensional property of such multinomial relations poses one fundamental challenge, that is, predicting multinomial relations with only a limited amount of data. In this paper, we propose a new multinomial relation prediction method, which is robust to data sparsity. We transform each instance of a multinomial relation into a set of binomial relations between the objects and the multinomial relation of the involved objects. We then apply an extension of a low-dimensional embedding technique to these binomial relations, which results in a generalized eigenvalue problem guaranteeing global optimal solutions. We also incorporate attribute information as side information to address the “cold start” problem in multinomial relation prediction. Experiments with various real-world social web service datasets demonstrate that the proposed method is more robust against data sparseness as compared to several existing methods, which can only find sub-optimal solutions.


Dynamically Switching between Synergistic Workflows for Crowdsourcing

AAAI Conferences

To ensure quality results from unreliable crowdsourced workers, task designers often construct complex workflows and aggregate worker responses from redundant runs. Frequently, they experiment with several alternative workflows to accomplish the task, and eventually deploy the one that achieves the best performance during early trials. Surprisingly, this seemingly natural design paradigm does not achieve the full potential of crowdsourcing. In particular, using a single workflow (even the best) to accomplish a task is suboptimal. We show that alternative workflows can compose synergistically to yield much higher quality output. We formalize the insight with a novel probabilistic graphical model. Based on this model, we design and implement AGENTHUNT, a POMDP-based controller that dynamically switches between these workflows to achieve higher returns on investment. Additionally, we design offline and online methods for learning model parameters. Live experiments on Amazon Mechanical Turk demonstrate the superiority of AGENTHUNT for the task of generating NLP training data, yielding up to 50% error reduction and greater net utility compared to previous methods.


A Convex Formulation for Learning from Crowds

AAAI Conferences

Recently crowdsourcing services are often used to collect a large amount of labeled data for machine learning, since they provide us an easy way to get labels at very low cost and in a short period. The use of crowdsourcing has introduced a new challenge in machine learning, that is, coping with the variable quality of crowd-generated data. Although there have been many recent attempts to address the quality problem of multiple workers, only a few of the existing methods consider the problem of learning classifiers directly from such noisy data. All these methods modeled the true labels as latent variables, which resulted in non-convex optimization problems. In this paper, we propose a convex optimization formulation for learning from crowds without estimating the true labels by introducing personal models of the individual crowd workers. We also devise an efficient iterative method for solving the convex optimization problems by exploiting conditional independence structures in multiple classifiers. We evaluate the proposed method against three competing methods on synthetic data sets and a real crowdsourced data set and demonstrate that the proposed method outperforms the other three methods.


ET-LDA: Joint Topic Modeling for Aligning Events and their Twitter Feedback

AAAI Conferences

During broadcast events such as the Superbowl, the U.S. Presidential and Primary debates, etc., Twitter has become the de facto platform for crowds to share perspectives and commentaries about them. Given an event and an associated large-scale collection of tweets, there are two fundamental research problems that have been receiving increasing attention in recent years. One is to extract the topics covered by the event and the tweets; the other is to segment the event. So far these problems have been viewed separately and studied in isolation. In this work, we argue that these problems are in fact inter-dependent and should be addressed together. We develop a joint Bayesian model that performs topic modeling and event segmentation in one unified framework. We evaluate the proposed model both quantitatively and qualitatively on two large-scale tweet datasets associated with two events from different domains to show that it improves significantly over baseline models.


Building Contextual Anchor Text Representation using Graph Regularization

AAAI Conferences

Anchor texts are useful complementary description for target pages, widely applied to improve search relevance. The benefits come from the additional information introduced into document representation and the intelligent ways of estimating their relative importance. Previous work on anchor importance estimation treated anchor text independently without considering its context. As a result, the lack of constraints from such context fails to guarantee a stable anchor text representation. We propose an anchor graph regularization approach to incorporate constraints from such context into anchor text weighting process, casting the task into a convex quadratic optimization problem. The constraints draw from the estimation of anchor-anchor, anchor-page, and page-page similarity. Based on any estimators, our approach operates as a post process of refining the estimated anchor weights, making it a plug and play component in search infrastructure. Comparable experiments on standard data sets (TREC 2009 and 2010) demonstrate the efficacy of our approach.


Fused Matrix Factorization with Geographical and Social Influence in Location-Based Social Networks

AAAI Conferences

Recently, location-based social networks (LBSNs), such as Gowalla, Foursquare, Facebook, and Brightkite, etc., have attracted millions of users to share their social friendship and their locations via check-ins. The available check-in information makes it possible to mine users’ preference on locations and to provide favorite recommendations. Personalized Point-of-interest (POI) recommendation is a significant task in LBSNs since it can help targeted users explore their surroundings as well as help third-party developers to provide personalized services. To solve this task, matrix factorization is a promising tool due to its success in recommender systems. However, previously proposed matrix factorization (MF) methods do not explore geographical influence, e.g., multi-center check-in property, which yields suboptimal solutions for the recommendation. In this paper, to the best of our knowledge, we are the first to fuse MF with geographical and social influence for POI recommendation in LBSNs. We first capture the geographical influence via modeling the probability of a user’s check-in on a location as a Multi-center Gaussian Model (MGM). Next, we include social information and fuse the geographical influence into a generalized matrix factorization framework. Our solution to POI recommendation is efficient and scales linearly with the number of observations. Finally, we conduct thorough experiments on a large-scale real-world LBSNs dataset and demonstrate that the fused matrix factorization framework with MGM utilizes the distance information sufficiently and outperforms other state-of-the-art methods significantly.


Automated Inference System for End-To-End Diagnosis of Network Performance Issues in Client-Terminal Devices

arXiv.org Artificial Intelligence

Traditional network diagnosis methods of Client-Terminal Device (CTD) problems tend to be laborintensive, time consuming, and contribute to increased customer dissatisfaction. In this paper, we propose an automated solution for rapidly diagnose the root causes of network performance issues in CTD. Based on a new intelligent inference technique, we create the Intelligent Automated Client Diagnostic (IACD) system, which only relies on collection of Transmission Control Protocol (TCP) packet traces. Using soft-margin Support Vector Machine (SVM) classifiers, the system (i) distinguishes link problems from client problems and (ii) identifies characteristics unique to the specific fault to report the root cause. The modular design of the system enables support for new access link and fault types. Experimental evaluation demonstrated the capability of the IACD system to distinguish between faulty and healthy links and to diagnose the client faults with 98% accuracy. The system can perform fault diagnosis independent of the user's specific TCP implementation, enabling diagnosis of diverse range of client devices.


Hybrid Grey Interval Relation Decision-Making in Artistic Talent Evaluation of Player

arXiv.org Artificial Intelligence

The multiple attribute decision-making (MADM) probl ems are of the most interesting problems for many decision-making experts. This problem aris es in various fields of the real life, and constitutes very important content in scientific research such as management science, decision-making theory, system theory, operational research and economics. Now, many effective methods to determine the att ributive weights have been studied for MADM. Those are the subjective weight determining methods such as the feature vector method ( Saaty T.L. 1977), the least square sum method (Chu A Tw, Kala ba R E, Spingarn K, 1979), Delphi and AHP method (Hwang C.L., Lin M, 1987), and the objective weight determining methods such as the entropy method (Hwang C.L., Yoon K, 1981), the principal component analysis (Yan Jian-huo, 1989) and DEA (Data Envelopment Analysis) (Ye Chen, Kevin W. Li, Haiyan Xu and Sifeng Liu, 2009). The final ranking method affects greatly on the dec ision-making process. Hwang and Yoon (1981) proposed a new approach, TOPSIS (Technique for Orde r Preference by Similarity to Ideal Solution) for solving MADM problem. Recently, TOPSIS methods with interval weights (Gao Feng-ji, et al, 2005) and multiple attribute interval number TOPSIS (Chu A Tw, Kalaba R E, Spingarn K, 1979) have been studied. Guo Kai-hong and Mu You-jing (2012) studied the relation between several possibility degree formulas and proposed a possibil ity degree matrices-based method that aimed to objectively determine the weights of criteria in MA DM with intervals. A hybrid approach integrating OWA (Ordered Weighted Averaging) aggreg ation into TOPSIS is proposed to tackle * This work was supported in part by Nanjing Univer sity of Aeronautics and Astronautics, China. 2