Asia
Catching the Long-Tail: Extracting Local News Events from Twitter
Agarwal, Puneet (TCS Innovation Labs, Delhi) | Vaithiyanathan, Rajgopal (TCS Innovation Labs, Delhi) | Sharma, Saurabh (TCS Innovation Labs, Delhi) | Shroff, Gautam (TCS Innovation Labs, Delhi)
Twitter, used in 200 countries with over 250 milliontweets a day, is a rich source of local news from aroundthe world. Many events of local importance are first reportedon Twitter, including many that never reach newschannels. Further, there are often only a few tweetsreporting each such event, in contrast with the largervolumes that follow events of wider significance. Eventhough such events may be primarily of local importance,they can also be of critical interest to some specificbut possibly far flung entities: For example, a firein a supplierโs factory half-way around the world maybe of interest even from afar. In this paper we describehow this โlong tailโ of events can be detected in spite oftheir sparsity.We then extract and correlate informationfrom multiple tweets describing the same event. Ourgeneric architecture for converting a tweet-stream intoevent-objects uses locality sensitive hashing, classification,boosting, information extraction and clustering.Our results, based on millions of tweets monitored overmany months, appear to validate our approach and architecture:We achieved success-rates in the 80% rangefor event detection and 76% on event-correlation; we also reduced tweet-comparisons by 80% using LSH.
On the Geo-Indicativeness of Non-Georeferenced Text
Adams, Benjamin (University of California, Santa Barbara) | Janowicz, Krzysztof (University of California, Santa Barbara)
Geographic location is a key component for information retrieval on the Web, recommendation systems in mobile computing and social networks, and place-based integration on the Linked Data cloud. Previous work has addressed how to estimate locations by named entity recognition, from images, and via structured data. In this paper, we estimate geographic regions from unstructured, non geo-referenced text by computing a probability distribution over the Earth's surface. Our methodology combines natural language processing, geostatistics, and a data-driven bottom-up semantics. We illustrate its potential for mapping geographic regions from non geo-referenced text.
The YouTube Social Network
Wattenhofer, Mirjam (Google Zurich) | Wattenhofer, Roger (ETH Zurich) | Zhu, Zack (ETH Zurich)
Today, YouTube is the largest user-driven video content provider in the world; it has become a major platform for disseminating multimedia information. A major contribution to its success comes from the user-to-user social experience that differentiates it from traditional content broadcasters. This work examines the social network aspect of YouTube by measuring the full-scale YouTube subscription graph, comment graph, and video content corpus. We find YouTube to deviate significantly from network characteristics that mark traditional online social networks, such as homophily, reciprocative linking, and assortativity. However, comparing to reported characteristics of another content-driven online social network, Twitter, YouTube is remarkably similar. Examining the social and content facets of user popularity, we find a stronger correlation between a user's social popularity and his/her most popular content as opposed to typical content popularity. Finally, we demonstrate an application of our measurements for classifying YouTube Partners, who are selected users that share YouTube's advertisement revenue. Results are motivating despite the highly imbalanced nature of the classification problem.
Coping with the Document Frequency Bias in Sentiment Classification
Rafrafi, Abdelhalim (University Pierre et Marie Curie) | Guigue, Vincent (University Pierre et Marie Curie) | Gallinari, Patrick (University Pierre et Marie Curie)
In this article, we study the polarity detection problem using linear supervised classifiers. We show the interest of penalizing the document frequencies in the regularization process to increase the accuracy. We propose a systematic comparison of different loss and regularization functions on this particular task using the Amazon dataset. Then, we evaluate our models according to three criteria: accuracy, sparsity and subjectivity. The subjectivity is measured by projecting our dictionary and optimized weight vector on the SentiWordNet lexicon. This original approach highlights a bias in the selection of the relevant terms during the regularization procedure: frequent terms are overweighted compared to their intrinsic subjectivities.We show that this bias appears whatever the chosen loss or regularization and on all datasets: it is closely link to the gradient descent technique. Penalizing the document frequency during the learning step enables us to improve significantly our performances. A lot of sentimental markers appear rarely and thus, are unappreciated by statistical learning algorithms. Explicitly boosting their influences leads to increasing the accuracy in the sentiment classification task.
On the Study of Social Interactions in Twitter
Macskassy, Sofus A. (University of Southern California)
Twitter and other social media platforms are increasingly used as the primary way in which people speak with each other. As opposed to other platforms, Twitter is interesting in that many of these dialogues are public and so we can get a view into the dynamics of dialogues and how they differ from other other tweet behaviors. We here analyze tweets gathered from 2400 twitter streams over a one month period. We study social interactions in three important dimensions: what are the salient user behaviors in terms of how often they have social interactions and how these interactions are spread among different people; what are the characteristics of the dialogues, or sets of tweets, that we can extract from these interactions, and what are the characteristics of the social network which emerges from considering these interactions? We find that roughly half of the users spend a fair amount of time interacting whereas 40% of users do not seem to have active interactions. We also find that the vast majority of active dialogues only involve two people despite the public nature of these tweets. We finally find that while the emerging social network does contain a giant component, the component clearly is a set of well-defined tight clusters which are loosely connected.
Modeling Diffusion in Social Networks Using Network Properties
Luu, Duc Minh (Singapore Management University) | Lim, Ee-Peng (Singapore Management University) | Hoang, Tuan-Anh (Singapore Management University) | Chua, Freddy Chong Tat (Singapore Management University)
Diffusion of items occurs in social networks due to spreading of items through word of mouth and exogenous factors. These items may be news, products, videos, advertisements or contagious viruses. Previous research has studied diffusion process at both the macro and micro levels. The former models the number of item adopters in the diffusion process while the latter determines which individuals adopt item. In this paper, we establish a general probabilistic framework, which can be used to derive macro-level diffusion models, including the well known Bass Model (BM). Using this framework, we develop several other models considering the social networkโs degree distribution coupled with the assumption of linear influence by neighboring adopters in the diffusion process. Through some evaluation on synthetic data, this paper shows that degree distribution actually changes during the diffusion process. We therefore introduce a multi-stage diffusion model to cope with variable degree distribution. By conducting experiments on both synthetic and real datasets, we show that our proposed diffusion models can recover the diffusion parameters from the observed diffusion data, which allows us to model diffusion with high accuracy.
OMG, I Have to Tweet that! A Study of Factors that Influence Tweet Rates
Kฤฑcฤฑman, Emre (Microsoft Research)
Many studies have shown that social data such as tweets are a rich source of information about the real-world including, for example, insights into health trends. A key limitation when analyzing Twitter data, however, is that it depends on people self-reporting their own behaviors and observations. In this paper, we present a large-scale quantitative analysis of some of the factors that influence self-reporting bias. In our study, we compare a year of tweets about weather events to ground-truth knowledge about actual weather occurrences. For each weather event we calculate how extreme, how expected, and how big a change the event represents. We calculate the extent to which these factors can explain the daily variations in tweet rates about weather events. We find that we can build global models that take into account basic weather information, together with extremeness, expectation and change calculations to account for over 40% of the variability in tweet rates. We build location-specific (i.e., a model per each metropolitan area) models that account for an average of 70% of the variability in tweet rates.
Temporal Motifs Reveal the Dynamics of Editor Interactions in Wikipedia
Jurgens, David (University of California, Los Angeles and HRL Laboratories, LLC) | Lu, Tsai-Ching (HRL Laboratories, LLC)
Wikipedia is a collaborative setting with both combative and cooperative editing. We propose a new method for investigating the types of editor interactions using a novel representation of Wikipedia's revision history as a temporal, bipartite network with multiple node and edge types for users and revisions. From this representation we identify significant author interactions as network motifs and show how the motif types capture important, diverse editing behaviors. Two experiments demonstrate the further benefit of motifs. First, we demonstrate significant performance improvement over a purely revision-based analysis in classifying pages as combative or cooperative page by using motifs; and second we use motifs as a basis for analyzing trends in the dynamics of editor behavior to explain Wikipedia's content growth.
The Livehoods Project: Utilizing Social Media to Understand the Dynamics of a City
Cranshaw, Justin (Carnegie Mellon University) | Schwartz, Raz (Carnegie Mellon University) | Hong, Jason (Carnegie Mellon University) | Sadeh, Norman (Carnegie Mellon University)
Studying the social dynamics of a city on a large scale has tra- ditionally been a challenging endeavor, requiring long hours of observation and interviews, usually resulting in only a par- tial depiction of reality. At the same time, the boundaries of municipal organizational units, such as neighborhoods and districts, are largely statically defined by the city government and do not always reflect the character of life in these ar- eas. To address both difficulties, we introduce a clustering model and research methodology for studying the structure and composition of a city based on the social media its res- idents generate. We use data from approximately 18 million check-ins collected from users of a location-based online so- cial network. The resulting clusters, which we call Livehoods, are representations of the dynamic urban areas that comprise the city. We take an interdisciplinary approach to validating these clusters, interviewing 27 residents of Pittsburgh, PA, to see how their perceptions of the city project onto our findings there. Our results provide strong support for the discovered clusters, showing how Livehoods reveal the distinctly charac- terized areas of the city and the forces that shape them.
Grief-Stricken in a Crowd: The Language of Bereavement and Distress in Social Media
Brubaker, Jed R. (University of California, Irvine) | Kivran-Swaine, Funda (Rutgers University) | Taber, Lee (University of California, Irvine) | Hayes, Gillian R. (University of California, Irvine)
People turn to social media to express their emotions surrounding major life events. Death of a loved one is one scenario in which people share their feelings in the semi-public space of social networking sites. In this paper, we present the results of a two-part investigation of grief and distress in the context of messages posted to the profiles of deceased MySpace users. We present coding system for identifying emotion distressed content, followed by a detailed analysis of language use that lays a foundation for natural language processing (NLP) tasks, such as automatic detection of bereavement-related distress. Our findings suggest that in addition to words bearing positive or negative sentiment, linguistic style can be an indicator of messages that demonstrate distress in the space of post-mortem social media content. These results contribute to research in computational linguistics by identifying linguistic features that can be used for automatic classification as well as to research on death and bereavement by enumerating attributes of distressed self-expression in a post-mortem context.