Statistical Learning
What Catches Your Attention? An Empirical Study of Attention Patterns in Community Forums
Wagner, Claudia (Institute of Information and Communication Technologies Joanneum Research) | Rowe, Matthew (Knowledge Media Institute The Open University) | Strohmaier, Markus (Knowledge Management Institute Graz University of Technology) | Alani, Harith (Knowledge Media Institute The Open University)
Online community managers work towards building and managing communities around a given brand or topic. A risk imposed on such managers is that their community may die out and its utility diminish to users. Understanding what drives attention to content and the dynamics of discussions in a given community informs the community manager and/or host with the factors that are associated with attention. In this paper we gain insights into the idiosyncrasies that individual community forums exhibit in their attention patterns and how the factors that impact activity differ. We glean such insights by using logistic regression models for identifying seed posts and explore the effectiveness of a range of features. Our findings show that the discussion behaviour of different communities is clearly impacted by different factors.
Trust Propagation with Mixed-Effects Models
Overgoor, Jan (Stanford University) | Wulczyn, Ellery (Stanford University) | Potts, Christopher (Stanford University)
Web-based social networks typically use public trust systems to facilitate interactions between strangers. These systems can be corrupted by misleading information spread under the cover of anonymity, or exhibit a strong bias towards positive feedback, originating from the fear of reciprocity. Trust propagation algorithms seek to overcome these shortcomings by inferring trust ratings between strangers from trust ratings between acquaintances and the structure of the network that connects them. We investigate a trust propagation algorithm that is based on user triads where the trust one user has in another is predicted based on an intermediary user. The propagation function can be applied iteratively to propagate trust along paths between a source user and a target user. We evaluate this approach using the trust network of the CouchSurfing community, which consists of 7.6M trust-valued edges between 1.1M users. We show that our model out-performs one that relies only on the trustworthiness of the target user (a kind of public trust system). In addition, we show that performance is significantly improved by bringing in user-level variability using mixed-effects regression models.
Evolutionary Clustering and Analysis of User Behaviour in Online Forums
Morrison, Donn (Digital Enterprise Research Institute) | McLoughlin, Ian (Digital Enterprise Research Institute) | Hogan, Alice (Digital Enterprise Research Institute) | Hayes, Conor (Digital Enterprise Research Institute)
In this paper we cluster and analyse temporal user behaviour in online communities. We adapt a simple unsupervised clustering algorithm to an evolutionary setting where we cluster users into prototypical behavioural roles based on features derived from their ego-centric reply-graphs. We then analyse changes in the role membership of the users over time, the change in role composition of forums over time and examine the differences between forums in terms of role composition. We perform this analysis on 200 forums from a popular national bulletin board and 14 enterprise technical support forums.
More of a Receiver Than a Giver: Why Do People Unfollow in Twitter?
Kwak, Haewoon (Telefonica Research) | Moon, Sue (KAIST) | Lee, Wonjae (KAIST)
We propose a logistic regression model taking into account two analytically different sets of factors–structure and action. The factors include individual, dyadic, and triadic properties between ego and alter whose tie breakup is under consideration. From the fitted model using a large-scale data, we discover 5 structural and 7 actional variables to have significant explanatory power for unfollow. One unique finding from our quantitative analysis is that people appreciate receiving acknowledgements from others even in virtually unilateral communication relationships and are less likely to unfollow them: people are more of a receiver than a giver.
Weblog Analysis for Predicting Correlations in Stock Price Evolutions
Kharratzadeh, Milad (McGill University) | Coates, Mark (McGill University)
We use data extracted from many weblogs to identify the underlying relations of a set of companies in the Standard and Poor (S\&P) 500 index. We define a pairwise similarity measure for the companies based on the weblog articles and then apply a graph clustering procedure. We show that it is possible to capture some interesting relations between companies using this method. As an application of this clustering procedure we propose a cluster-based portfolio selection method which combines information from the weblog data and historical stock prices. Through simulation experiments, we show that our method performs better (in terms of risk measures) than cluster-based portfolio strategies based on company sectors or historical stock prices. This suggests that the methodology has the potential to identify groups of companies whose stock prices are more likely to be correlated in the future.
Homophily and Latent Attribute Inference: Inferring Latent Attributes of Twitter Users from Neighbors
Zamal, Faiyaz Al (McGill University) | Liu, Wendy (McGill University) | Ruths, Derek (McGill University)
In this paper, we extend existing work on latent attribute inference by leveraging the principle of homophily: we evaluate the inference accuracy gained by augmenting the user features with features derived from the Twitter profiles and postings of her friends. We consider three attributes which have varying degrees of assortativity: gender, age, and political affiliation. Our approach yields a significant and robust increase in accuracy for both age and political affiliation, indicating that our approach boosts performance for attributes with moderate to high assortativity. Furthermore, different neighborhood subsets yielded optimal performance for different attributes, suggesting that different subsamples of the user's neighborhood characterize different aspects of the user herself. Finally, inferences using only the features of a user's neighbors outperformed those based on the user's features alone. This suggests that the neighborhood context alone carries substantial information about the user.
Automatic Versus Human Navigation in Information Networks
West, Robert (Stanford University) | Leskovec, Jure (Stanford University)
People regularly face tasks that can be understood as navigation in information networks, where the goal is to find a path between two given nodes. In many such situations, the navigator only gets local access to the node currently under inspection and its immediate neighbors. This lack of global information about the network notwithstanding, humans tend to be good at finding short paths, despite the fact that real-world networks are typically very large. One potential reason for this could be that humans possess vast amounts of background knowledge about the world, which they leverage to make good guesses about possible solutions. In this paper we ask the question: Are human-like high-level reasoning skills really necessary for finding short paths? To answer this question, we design a number of navigation agents without such skills, which use only simple numerical features. We evaluate the agents on the task of navigating Wikipedia, a domain for which we also possess large-scale human navigation data. We observe that the agents find shorter paths than humans on average and therefore conclude that, perhaps surprisingly, no sophisticated background knowledge or high-level reasoning is required for navigating the complex Wikipedia network.
Modeling Spread of Disease from Social Interactions
Sadilek, Adam (University of Rochester) | Kautz, Henry (University of Rochester) | Silenzio, Vincent (University of Rochester)
Research in computational epidemiology to date has concentrated on coarse-grained statistical analysis of populations, often synthetic ones. By contrast, this paper focuses on fine-grained modeling of the spread of infectious diseases throughout a large real-world social network. Specifically, we study the roles that social ties and interactions between specific individuals play in the progress of a contagion. We focus on public Twitter data, where we find that for every health-related message there are more than 1,000 unrelated ones. This class imbalance makes classification particularly challenging. Nonetheless, we present a framework that accurately identifies sick individuals from the content of online communication. Evaluation on a sample of 2.5 million geo-tagged Twitter messages shows that social ties to infected, symptomatic people, as well as the intensity of recent co-location, sharply increase one's likelihood of contracting the illness in the near future. To our knowledge, this work is the first to model the interplay of social activity, human mobility, and the spread of infectious disease in a large real-world population. Furthermore, we provide the first quantifiable estimates of the characteristics of disease transmission on a large scale without active user participation---a step towards our ability to model and predict the emergence of global epidemics from day-to-day interpersonal interactions.
Coping with the Document Frequency Bias in Sentiment Classification
Rafrafi, Abdelhalim (University Pierre et Marie Curie) | Guigue, Vincent (University Pierre et Marie Curie) | Gallinari, Patrick (University Pierre et Marie Curie)
In this article, we study the polarity detection problem using linear supervised classifiers. We show the interest of penalizing the document frequencies in the regularization process to increase the accuracy. We propose a systematic comparison of different loss and regularization functions on this particular task using the Amazon dataset. Then, we evaluate our models according to three criteria: accuracy, sparsity and subjectivity. The subjectivity is measured by projecting our dictionary and optimized weight vector on the SentiWordNet lexicon. This original approach highlights a bias in the selection of the relevant terms during the regularization procedure: frequent terms are overweighted compared to their intrinsic subjectivities.We show that this bias appears whatever the chosen loss or regularization and on all datasets: it is closely link to the gradient descent technique. Penalizing the document frequency during the learning step enables us to improve significantly our performances. A lot of sentimental markers appear rarely and thus, are unappreciated by statistical learning algorithms. Explicitly boosting their influences leads to increasing the accuracy in the sentiment classification task.
Evolution of Experts in Question Answering Communities
Pal, Aditya (University of Minnesota) | Chang, Shuo (University of Minnesota) | Konstan, Joseph A. (University of Minnesota)
Community Question Answering (CQA) services thrive as a result of a small number of highly active users, typically called experts, who provide a large number of high quality useful answers. Understanding the temporal dynamics and interactions between experts can present key insights into how community members evolve over time. In this paper, we present a temporal study of experts in CQA and analyze the changes in their behavioral patterns over time. Further, using unsupervised machine learning methods, we show the interesting evolution patterns that can help us distinguish experts from one another. Using supervised classification methods, we show that the models based on evolutionary data of users can be more effective at expert identification than the models that ignore evolution. We run our experiments on two large online CQA to show the generality of our proposed approach.