Goto

Collaborating Authors

 Data Science


Learning Hash Functions for Cross-View Similarity Search

AAAI Conferences

Many applications in Multilingual and Multimodal Information Access involve searching large databases of high dimensional data objects with multiple (conditionally independent) views. In this work we consider the problem of learning hash functions for similarity search across the views for such applications. We propose a principled method for learning a hash function for each view given a set of multiview training data objects. The hash functions map similar objects to similar codes across the views thus enabling cross-view similarity search. We present results from an extensive empirical study of the proposed approach which demonstrate its effectiveness on Japanese language People Search and Multilingual People Search problems.


Astroinformatics of galaxies and quasars: a new general method for photometric redshifts estimation

arXiv.org Machine Learning

With the availability of the huge amounts of data produced by current and future large multi-band photometric surveys, photometric redshifts have become a crucial tool for extragalactic astronomy and cosmology. In this paper we present a novel method, called Weak Gated Experts (WGE), which allows to derive photometric redshifts through a combination of data mining techniques. \noindent The WGE, like many other machine learning techniques, is based on the exploitation of a spectroscopic knowledge base composed by sources for which a spectroscopic value of the redshift is available. This method achieves a variance \sigma^2(\Delta z)=2.3x10^{-4} (\sigma^2(\Delta z) =0.08), where \Delta z = z_{phot} - z_{spec}) for the reconstruction of the photometric redshifts for the optical galaxies from the SDSS and for the optical quasars respectively, while the Root Mean Square (RMS) of the \Delta z variable distributions for the two experiments is respectively equal to 0.021 and 0.35. The WGE provides also a mechanism for the estimation of the accuracy of each photometric redshift. We also present and discuss the catalogs obtained for the optical SDSS galaxies, for the optical candidate quasars extracted from the DR7 SDSS photometric dataset {The sample of SDSS sources on which the accuracy of the reconstruction has been assessed is composed of bright sources, for a subset of which spectroscopic redshifts have been measured.}, and for optical SDSS candidate quasars observed by GALEX in the UV range. The WGE method exploits the new technological paradigm provided by the Virtual Observatory and the emerging field of Astroinformatics.


The Party Is Over Here: Structure and Content in the 2010 Election

AAAI Conferences

In this work, we study the use of Twitter by House, Senate and gubernatorial candidates during the midterm (2010) elections in the U.S. Our data includes almost 700 candidates and over 690k documents that they produced and cited in the 3.5 years leading to the elections. We utilize graph and text mining techniques to analyze differences between Democrats, Republicans and Tea Party candidates, and suggest a novel use of language modeling for estimating content cohesiveness. Our findings show significant differences in the usage patterns of social media, and suggest conservative candidates used this medium more effectively, conveying a coherent message and maintaining a dense graph of connections. Despite the lack of party leadership, we find Tea Party members display both structural and language-based cohesiveness. Finally, we investigate the relation between network structure, content and election results by creating a proof-of-concept model that predicts candidate victory with an accuracy of 88.0%.


Improving Text Clustering with Social Tagging

AAAI Conferences

Another important question is the absoluteness of the constraints. Lately several web-based tagging systems such as Technorati, Even if we use this approach to turn tags into constraints, Flickr or Delicious have become very popular. In this a fair amount of them are bound to be inaccurate paper we will exploit the information created by the community (i.e., linking documents which should not be in the same in Delicious: a social bookmarking service where cluster) until a high value of the parameter t, due to the polysemy the users can save the URLs of their favourite webpages of the terms used as tags or to differences in the criteria offering also the possibility of associating tags to them. of the taggers. Consequently, we have used soft positive On the other hand the clustering methods are a very important constraints, meaning that the documents affected by one of data mining tool in order to exploit the knowledge them are likely to be in the same cluster, without forcing the present in data collections. In the last years a new family of clustering algorithm to actually put them so.


Using Hierarchical Community Structure to Improve Community-Based Message Routing

AAAI Conferences

Information about community structure can be useful in a variety of mobile web applications. For instance, it has been shown that community-based methods can be more effective than alternatives for routing messages in delay-tolerant networks. In this paper we present initial research that shows that information on hierarchical structures in communities can further improve the effectiveness of message routing. This is interesting because despite much previous work on the topic, there have been few concrete applications which exploit hierarchical community structure.


Sentiment Flow Through Hyperlink Networks

AAAI Conferences

How does sentiment flow through hyperlink networks? Earlier work on hyperlink networks has focused on the structure of the network, often modeling posts as nodes in a directed graph in which edges represent hyperlinks. At the same time, sentiment analysis has largely focused on classifying texts in isolation. Here we analyze a large hyperlinked network of mass media and weblog posts to determine how sentiment features of a post affect the sentiment of connected posts and the structure of the network itself. We explore the phenomena of sentiment flow through experiments on a graph containing nearly 8 million nodes and 15 million edges. Our analysis indicates that (1) nodes are strongly influenced by their immediate neighbors, (2) deep cascades lead complex but predictable lives, (3) shallow cascades tend to be objective, and (4) sentiment becomes more polarized as depth increases.


Latent Set Models for Two-Mode Network Data

AAAI Conferences

Two-mode networks are a natural representation for many kinds of relational data. These networks are bipartite graphs consisting of two distinct sets ("modes") of entities. For example, one can model multiple recipient email data as a two-mode network of (a) individuals and (b) the emails that they send or receive. In this work we present a statistical model for two-mode network data which posits that individuals belong to latent sets and that the members of a particular set tend to co-appear. We show how to infer these latent sets from observed data using a Markov chain Monte Carlo inference algorithm. We apply the model to the Enron email corpus, using it to discover interpretable latent structure as well as evaluating its predictive accuracy on a missing data task. Extensions to the model are discussed that incorporate additional side information such as the email's sender or text content, further improving the accuracy of the model.


Large-Scale Community Detection on YouTube for Topic Discovery and Exploration

AAAI Conferences

Detecting coherent, well-connected communities in large graphs provides insight into the graph structure and can serve as the basis for content discovery. Clustering is a popular technique for community detection but global algorithms that examine the entire graph do not scale. Local algorithms are highly parallelizable but perform sub-optimally, especially in applications where we need to optimize multiple metrics. We present a multi-stage algorithm based on local-clustering that is highly scalable, combining a pre-processing stage, a lo- cal clustering stage, and a post-processing stage. We apply it to the YouTube video graph to generate named clusters of videos with coherent content. We formalize coverage, co- herence, and connectivity metrics and evaluate the quality of the algorithm for large YouTube graphs. Our use of local algorithms for global clustering, and its implementation and practical evaluation on such a large scale is a first of its kind.


Scalable Event-Based Clustering of Social Media Via Record Linkage Techniques

AAAI Conferences

We tackle the problem of grouping content available in social media applications such as Flickr, Youtube, Panoramino etc. into clusters of documents describing the same event. This task has been referred to as event identification before. We present a new formalization of the event identification task as a record linkage problem and show that this formulation leads to a principled and highly efficient solution to the problem. We present results on two datasets derived from Flickr — last.fm and upcoming — comparing the results in terms of Normalized Mutual Information and F-Measure with respect to several baselines, showing that a record linkage approach outperforms all baselines as well as a state-of-the-art system. We demonstrate that our approach can scale to large amounts of data, reducing the processing time considerably compared to a state-of-the-art approach. The scalability is achieved by applying an appropriate blocking strategy and relying on a Single Linkage clustering algorithm which avoids the exhaustive computation of pairwise similarities.


Information Propagation on the Web: Data Extraction, Modeling and Simulation

AAAI Conferences

This paper proposes a model of information propagation mechanisms on the Web, describing all steps of its design and use in simulation. First the characteristics of a real network are studied, in particular in terms of citation policies: from a network extracted from the Web by a crawling tool, distinct publishing behaviours are identified and characterised. The Zero Crossing model for information diffusion is then extended to increase its expressive power and allow it to reproduce this variety of behaviours. Experimental results based on a simulation validate the proposed extension.