Goto

Collaborating Authors

 Data Science


Improving Text Clustering with Social Tagging

AAAI Conferences

Another important question is the absoluteness of the constraints. Lately several web-based tagging systems such as Technorati, Even if we use this approach to turn tags into constraints, Flickr or Delicious have become very popular. In this a fair amount of them are bound to be inaccurate paper we will exploit the information created by the community (i.e., linking documents which should not be in the same in Delicious: a social bookmarking service where cluster) until a high value of the parameter t, due to the polysemy the users can save the URLs of their favourite webpages of the terms used as tags or to differences in the criteria offering also the possibility of associating tags to them. of the taggers. Consequently, we have used soft positive On the other hand the clustering methods are a very important constraints, meaning that the documents affected by one of data mining tool in order to exploit the knowledge them are likely to be in the same cluster, without forcing the present in data collections. In the last years a new family of clustering algorithm to actually put them so.


Using Hierarchical Community Structure to Improve Community-Based Message Routing

AAAI Conferences

Information about community structure can be useful in a variety of mobile web applications. For instance, it has been shown that community-based methods can be more effective than alternatives for routing messages in delay-tolerant networks. In this paper we present initial research that shows that information on hierarchical structures in communities can further improve the effectiveness of message routing. This is interesting because despite much previous work on the topic, there have been few concrete applications which exploit hierarchical community structure.


Sentiment Flow Through Hyperlink Networks

AAAI Conferences

How does sentiment flow through hyperlink networks? Earlier work on hyperlink networks has focused on the structure of the network, often modeling posts as nodes in a directed graph in which edges represent hyperlinks. At the same time, sentiment analysis has largely focused on classifying texts in isolation. Here we analyze a large hyperlinked network of mass media and weblog posts to determine how sentiment features of a post affect the sentiment of connected posts and the structure of the network itself. We explore the phenomena of sentiment flow through experiments on a graph containing nearly 8 million nodes and 15 million edges. Our analysis indicates that (1) nodes are strongly influenced by their immediate neighbors, (2) deep cascades lead complex but predictable lives, (3) shallow cascades tend to be objective, and (4) sentiment becomes more polarized as depth increases.


Latent Set Models for Two-Mode Network Data

AAAI Conferences

Two-mode networks are a natural representation for many kinds of relational data. These networks are bipartite graphs consisting of two distinct sets ("modes") of entities. For example, one can model multiple recipient email data as a two-mode network of (a) individuals and (b) the emails that they send or receive. In this work we present a statistical model for two-mode network data which posits that individuals belong to latent sets and that the members of a particular set tend to co-appear. We show how to infer these latent sets from observed data using a Markov chain Monte Carlo inference algorithm. We apply the model to the Enron email corpus, using it to discover interpretable latent structure as well as evaluating its predictive accuracy on a missing data task. Extensions to the model are discussed that incorporate additional side information such as the email's sender or text content, further improving the accuracy of the model.


Large-Scale Community Detection on YouTube for Topic Discovery and Exploration

AAAI Conferences

Detecting coherent, well-connected communities in large graphs provides insight into the graph structure and can serve as the basis for content discovery. Clustering is a popular technique for community detection but global algorithms that examine the entire graph do not scale. Local algorithms are highly parallelizable but perform sub-optimally, especially in applications where we need to optimize multiple metrics. We present a multi-stage algorithm based on local-clustering that is highly scalable, combining a pre-processing stage, a lo- cal clustering stage, and a post-processing stage. We apply it to the YouTube video graph to generate named clusters of videos with coherent content. We formalize coverage, co- herence, and connectivity metrics and evaluate the quality of the algorithm for large YouTube graphs. Our use of local algorithms for global clustering, and its implementation and practical evaluation on such a large scale is a first of its kind.


Scalable Event-Based Clustering of Social Media Via Record Linkage Techniques

AAAI Conferences

We tackle the problem of grouping content available in social media applications such as Flickr, Youtube, Panoramino etc. into clusters of documents describing the same event. This task has been referred to as event identification before. We present a new formalization of the event identification task as a record linkage problem and show that this formulation leads to a principled and highly efficient solution to the problem. We present results on two datasets derived from Flickr — last.fm and upcoming — comparing the results in terms of Normalized Mutual Information and F-Measure with respect to several baselines, showing that a record linkage approach outperforms all baselines as well as a state-of-the-art system. We demonstrate that our approach can scale to large amounts of data, reducing the processing time considerably compared to a state-of-the-art approach. The scalability is achieved by applying an appropriate blocking strategy and relying on a Single Linkage clustering algorithm which avoids the exhaustive computation of pairwise similarities.


Information Propagation on the Web: Data Extraction, Modeling and Simulation

AAAI Conferences

This paper proposes a model of information propagation mechanisms on the Web, describing all steps of its design and use in simulation. First the characteristics of a real network are studied, in particular in terms of citation policies: from a network extracted from the Web by a crawling tool, distinct publishing behaviours are identified and characterised. The Zero Crossing model for information diffusion is then extended to increase its expressive power and allow it to reproduce this variety of behaviours. Experimental results based on a simulation validate the proposed extension.


Reconstruction of Threaded Conversations in Online Discussion Forums

AAAI Conferences

Online discussion boards, or Internet forums, are a significant part of the Internet. People use Internet forums to post questions, provide advice and participate in discussions. These online conversations are represented as threads, and the conversation trees within these threads are important in understanding the behaviour of online users. Unfortunately, the reply structures of these threads are generally not publicly accessible or not maintained. Hence, in this paper, we introduce an efficient and simple approach to reconstruct the reply structure in threaded conversations. We contrast its accuracy against three baseline algorithms, and show that our algorithm can accurately recreate the in and out degree distributions of forum reply graphs built from the reconstructed reply structures.


Asked and Answered: On Qualities and Quantities of Answers in Online Q&A Sites

AAAI Conferences

This paper builds upon several recent research efforts that have explored the nature and qualities of questions asked on these social Q&A sites by offering a focused examination of answers posted to three of the most popular Q&A sites. Specifically, this paper examines sets of answers responding to specific types of questions and explores the degree to which question types are predictive of answer quantity and answer quality. Blending qualitative and quantitative methods, the paper builds upon rich coding of a representative sets of real questions — drawn from Answerbag, (Ask) MetaFilter, and Yahoo! Answers — in order to better understand whether the explicit and implicit theories and predictions drawn from coding of these questions were borne out in the corresponding answer sets found on these sites. Quantitative findings include data underscoring the general overall success of social Q&A sites in producing answers that can satisfy the needs of those who pose questions. Additionally, this paper presents a predictive model that can anticipate the archival value of answers based on the category and qualities of questions asked. Qualitative findings include an analysis of the variation in responses to questions that are primarily seeking objective, grounded information relative to those seeking subjective opinions.


Social Mechanics: An Empirically Grounded Science of Social Media

AAAI Conferences

What will social media sites of tomorrow look like? What behaviors will their interfaces enable? A major challenge for designing new sites that allow a broader range of user actions is the difficulty of extrapolating from experience with current sites without first distinguishing correlations from underlying causal mechanisms. The growing availability of data on user activities provides new opportunities to uncover correlations among user activity, contributed content and the structure of links among users. However, such correlations do not necessarily translate into predictive models. Instead, empirically grounded mechanistic models provide a stronger basis for establishing causal mechanisms and discovering the underlying statistical laws governing social behavior. We describe a statistical physics-based framework for modeling and analyzing social media and illustrate its application to the problems of prediction and inference. We hope these examples will inspire the research community to explore these methods to look for empirically valid causal mechanisms for the observed correlations.