Not enough data to create a plot.
Try a different view from the menu above.
Information Technology
Sparse Matrix-Variate t Process Blockmodels
Xu, Zenglin (Purdue University) | Yan, Feng (Purdue University) | Qi, Yuan (Purdue University)
We consider the problem of modeling network interactions and identifying latent groups of network nodes. This problem is challenging due to the facts i) that the network nodes are interdependent instead of independent, ii) that the network data are very noisy (e.g., missing edges), and iii) that the network interactions are often sparse. To address these challenges, we propose a Sparse Matrix-variate t process Blockmodel (SMTB). In particular, we generalize a matrix-variate t distribution to a t process on matrices with nonlinear covariance functions. Due to this generalization, our model can estimate latent memberships for individual network nodes. This separates our model from previous t distribution based relational models. Also, we introduce sparse prior distributions on the latent membership parameters to select group assignments for individual nodes. To learn the model efficiently from data, we develop a variational method. When compared with several state-of-the-art models, including the predictive matrix-variate t models and mixed membership stochastic blockmodels, our model achieved improved prediction accuracy on real world network datasets.
User-Controllable Learning of Location Privacy Policies With Gaussian Mixture Models
Cranshaw, Justin (Carnegie Mellon University) | Mugan, Jonathan (Carnegie Mellon University) | Sadeh, Norman (Carnegie Mellon University)
With smart-phones becoming increasingly commonplace, there has been a subsequent surge in applications that continuously track the location of users. However, serious privacy concerns arise as people start to widely adopt these applications. Users will need to maintain policies to determine under which circumstances to share their location. Specifying these policies however, is a cumbersome task, suggesting that machine learning might be helpful. In this paper, we present a user-controllable method for learning location sharing policies. We use a classifier based on multivariate Gaussian mixtures that is suitably modified so as to restrict the evolution of the underlying policy to favor incremental and therefore human-understandable changes as new data arrives. We evaluate the model on real location-sharing policies collected from a live location-sharing social network, and we show that our method can learn policies in a user-controllable setting that are just as accurate as policies that do not evolve incrementally. Additionally, we highlight the strength of the generative modeling approach we take, by showing how our model easily extends to the semi-supervised setting.
Ensemble Classification for Relational Domains
Eldardiry, Hoda (Purdue University)
Ensemble classification methods have been shown to produce more accurate predictions than the base component models. Due to their effectiveness, ensemble approaches have been applied in a wide range of domains to improve classification. The expected prediction error of classification models can be decomposed into bias and variance. Ensemble methods that independently construct component models (e.g., bagging) can improve performance by reducing the error due to variance, while methods that dependently construct component models (e.g., boosting) can improve performance by reducing the error due to bias and variance. Although ensemble methods were initially developed for classification of independent and identically distributed (i.i.d.) data, they can be directly applied for relational data by using a relational classifier as the base component model. This straightforward approach can improve classification for network data, but suffers from a number of limitations. First, relational data characteristics will only be exploited by the base relational classifier, and not by the ensemble algorithm itself. We note that explicitly accounting for the structured nature of relational data by the ensemble mechanism can significantly improve ensemble classification. Second, ensemble learning methods that assume i.i.d. data can fail to preserve the relational structure of non-i.i.d. data, which will (1) prevent the relational base classifiers from exploiting these structures, and (2) fail to accurately capture properties of the dataset, which can lead to inaccurate models and classifications. Third, ensemble mechanisms that assume i.i.d. data are limited to reducing errors associated with i.i.d. models and fail to reduce additional sources of error associated with more powerful (e.g., collective classification models. Our key observation is that collective classification methods have error due to variance in inference. This has been overlooked by current ensemble methods that assume exact inference methods and only focus on the typical goal of reducing errors due to learning, even if the methods explicitly consider relational data. Here we study the problem of ensemble classification for relational domains by focusing on the reduction of error due to variance. We propose a relational ensemble framework that explicitly accounts for the structured nature of relational data during both learning and inference. Our proposed framework consists of two components. (1) A method for learning accurate ensembles from relational data, focusing on the reduction of error due to variance in learning, while preserving the relational characteristics in the data. (2) A method for applying ensembles in collective classification contexts, focusing on further reduction of the error due to variance in inference, which has not been considered in state of the art ensemble methods.
The Party Is Over Here: Structure and Content in the 2010 Election
Livne, Avishay (The University of Michigan) | Simmons, Matthew (The University of Michigan) | Adar, Eytan (The University of Michigan) | Adamic, Lada (The University of Michigan)
In this work, we study the use of Twitter by House, Senate and gubernatorial candidates during the midterm (2010) elections in the U.S. Our data includes almost 700 candidates and over 690k documents that they produced and cited in the 3.5 years leading to the elections. We utilize graph and text mining techniques to analyze differences between Democrats, Republicans and Tea Party candidates, and suggest a novel use of language modeling for estimating content cohesiveness. Our findings show significant differences in the usage patterns of social media, and suggest conservative candidates used this medium more effectively, conveying a coherent message and maintaining a dense graph of connections. Despite the lack of party leadership, we find Tea Party members display both structural and language-based cohesiveness. Finally, we investigate the relation between network structure, content and election results by creating a proof-of-concept model that predicts candidate victory with an accuracy of 88.0%.
Why do People Retweet? Anti-Homophily Wins the Day!
Macskassy, Sofus A. ( Fetch Technologies ) | Michelson, Matthew (Fetch Technologies)
Twitter and other microblogs have rapidly become a significant means by which people communicate with the world and each other in near realtime. There has been a large number of studies surrounding these social media, focusing on areas such as information spread, various centrality measures, topic detection and more. However, one area which has not received much attention is trying to better understand what information is being spread and why it is being spread. This work looks to get a better understanding of what makes people spread information in tweets or microblogs through the use of retweeting. Several retweet behavior models are presented and evaluated on a Twitter data set consisting of over 768,000 tweets gathered from monitoring over 30,000 users for a period of one month. We evaluate the proposed models against each user and show how people use different retweet behavior models. For example, we find that although users in the majority of cases do not retweet information on topics that they themselves Tweet about as or from people who are "like them" (hence anti-homophily), we do find that models which do take homophily, or similarity, into account fits the observed retweet behaviors much better than other more general models which do not take this into account. We further find that, not surprisingly, people's retweeting behavior is better explained through multiple different models rather than one model.
Seven Months with the Devils: A Long-Term Study of Content Polluters on Twitter
Lee, Kyumin (Texas A&M University) | Eoff, Brian David (Texas A&M University) | Caverlee, James (Texas A&M University)
The rise in popularity of social networking sites such as Twitter and Facebook has been paralleled by the rise of unwanted, disruptive entities on these networks- — including spammers, malware disseminators, and other content polluters. Inspired by sociologists working to ensure the success of commons and criminologists focused on deterring vandalism and preventing crime, we present the first long-term study of social honeypots for tempting, profiling, and filtering content polluters in social media. Concretely, we report on our experiences via a seven-month deployment of 60 honeypots on Twitter that resulted in the harvesting of 36,000 candidate content polluters. As part of our study, we (1) examine the harvested Twitter users, including an analysis of link payloads, user behavior over time, and followers/following network dynamics and (2) evaluate a wide range of features to investigate the effectiveness of automatic content polluter identification.
A Machine Learning Approach to Twitter User Classification
Pennacchiotti, Marco (Yahoo! Labs) | Popescu, Ana-Maria (Yahoo! Labs)
This paper addresses the task of user classification in social media, with an application to Twitter. We automatically infer the values of user attributes such as political orientation or ethnicity by leveraging observable information such as the user behavior, network structure and the linguistic content of the user’s Twitter feed. We employ a machine learning approach which relies on a comprehensive set of features derived from such user information. We report encouraging experimental results on 3 tasks with different characteristics: political affiliation detection, ethnicity identification and detecting affinity for a particular business. Finally, our analysis shows that rich linguistic features prove consistently valuable across the 3 tasks and show great promise for additional user classification needs.
Exploiting Semantic Annotations for Clustering Geographic Areas and Users in Location-based Social Networks
Noulas, Anastasios (University of Cambridge) | Scellato, Salvatore (University of Cambridge) | Mascolo, Cecilia (University of Cambridge) | Pontil, Massimiliano (University College London)
Location-Based Social Networks (LBSN) present so far the most vivid realization of the convergence of the physical and virtual social planes. In this work we propose a novel approach on modeling human activity and geographical areas by means of place categories. We apply a spectral clustering algorithm on areas and users of two metropolitan cities on a dataset sourced from the most vibrant LBSN, Foursquare. Our methodology allows the identification of user communities that visit similar categories of places and the comparison of urban neighborhoods within and across cities. We demonstrate how semantic information attached to places could be plausibly used as a modeling interface for applications such as recommender systems and digital tourist guides.
Improving Text Clustering with Social Tagging
Ares, M. Eduardo (University of A Coruña) | Parapar, Javier (University of A Coruña) | Barreiro, Álvaro (University of A Coruña)
Another important question is the absoluteness of the constraints. Lately several web-based tagging systems such as Technorati, Even if we use this approach to turn tags into constraints, Flickr or Delicious have become very popular. In this a fair amount of them are bound to be inaccurate paper we will exploit the information created by the community (i.e., linking documents which should not be in the same in Delicious: a social bookmarking service where cluster) until a high value of the parameter t, due to the polysemy the users can save the URLs of their favourite webpages of the terms used as tags or to differences in the criteria offering also the possibility of associating tags to them. of the taggers. Consequently, we have used soft positive On the other hand the clustering methods are a very important constraints, meaning that the documents affected by one of data mining tool in order to exploit the knowledge them are likely to be in the same cluster, without forcing the present in data collections. In the last years a new family of clustering algorithm to actually put them so.
Dimensions of Self-Expression in Facebook Status Updates
Kramer, Adam D. I. (Facebook, Inc.) | Chung, Cindy K. (The University of Texas at Austin)
We describe the dimensions along which Facebook users tend to express themselves via status updates using the semi-automated text analysis approach, the Meaning Extraction Method (MEM). First, we examined dimensions of self-expression in all status updates from a sample of four million Facebook users from four English-speaking countries (the United States, Canada, the United Kingdom, and Australia) in order to examine how these countries vary in their self-expressions. All four countries showed a basic three-component structure, indicating that the medium is a stronger influence than country characteristics or demographics on how people use Facebook status updates. In each country, people vary in terms of the extent to which they use Informal Speech, share Positive Events, and discuss School in their Facebook status updates. Together, these factors tell us how users differ in their self-expression, and thus illustrate meaningful use cases for the product: Talking about what’s going on tends to be positive, and people vary in terms of the extent to which their status updates are short, slangy emotional expressions and topics regarding school. The specific words that define these factors showed subtle differences across countries: The use of profanity indicates fewer school words (but only in Australia), whereas the UK shows greater use of slang terms (rather than profanity) when speaking informally. The MEM also identified English-language dialects as a meaningful dimension along which the countries varied. In sum, beyond simply indicating topicality of posts, this study provides insight into how status updates are used for self-expression. We discuss several theoretical frameworks that could produce these results, and more broadly discuss the generation of theoretical frameworks from wholly empirical data (such as naturalistic Internet speech) using the MEM.