Statistical Learning
mlpy: Machine Learning Python
Albanese, Davide, Visintainer, Roberto, Merler, Stefano, Riccadonna, Samantha, Jurman, Giuseppe, Furlanello, Cesare
We introduce here mlpy, a library providing access to a wide spectrum of machine learning methods implemented in Python, which has proven to be an effective environment for building scientific oriented tools (Pรฉrez et al., 2011). Although planned for general purpose applications, mlpy has the computational biology in general, and the functional genomics modeling in particular, as the elective application fields. As a major applications example, we use mlpy methods to implement molecular profiling experiments that need to warrant study reproducibility(Ioannidis et al., 2009) and flawless results(Ambroise and McLachlan, 2002). This task requires the availability of highly modular tools allowing the practioners to build an adequate workflow for the task at hand following authoritative guidelines (The MicroArray Quality Control (MAQC) Consortium, 2010). Such workflow involves a complex sequence of steps, both in the development and in the validation phases, starting from the upstream preprocessing algorithms to the downstream predictive analysis, repeated several times to accommodate the resampling schema. The dimension of highthroughtput data involved (thousands of samples described by millions of features) and the large number of replicates needed to control bias effects make also efficiency an essential requirement.
Risk Bounds for CART Classifiers under a Margin Condition
Risk bounds for Classification and Regression Trees (CART, Breiman et. al. 1984) classifiers are obtained under a margin condition in the binary supervised classification framework. These risk bounds are obtained conditionally on the construction of the maximal deep binary tree and permit to prove that the linear penalty used in the CART pruning algorithm is valid under a margin condition. It is also shown that, conditionally on the construction of the maximal tree, the final selection by test sample does not alter dramatically the estimation accuracy of the Bayes classifier. In the two-class classification framework, the risk bounds that are proved, obtained by using penalized model selection, validate the CART algorithm which is used in many data mining applications such as Biology, Medicine or Image Coding.
Protocols for Learning Classifiers on Distributed Data
Daume, Hal III, Phillips, Jeff M., Saha, Avishek, Venkatasubramanian, Suresh
We consider the problem of learning classifiers for labeled data that has been distributed across several nodes. Our goal is to find a single classifier, with small approximation error, across all datasets while minimizing the communication between nodes. This setting models real-world communication bottlenecks in the processing of massive distributed datasets. We present several very general sampling-based solutions as well as some two-way protocols which have a provable exponential speed-up over any one-way protocol. We focus on core problems for noiseless data distributed across two or more nodes. The techniques we introduce are reminiscent of active learning, but rather than actively probing labels, nodes actively communicate with each other, each node simultaneously learning the important data from another node.
Empirical estimation of entropy functionals with confidence
Sricharan, Kumar, Raich, Raviv, Hero, Alfred O. III
This paper introduces a class of k-nearest neighbor ($k$-NN) estimators called bipartite plug-in (BPI) estimators for estimating integrals of non-linear functions of a probability density, such as Shannon entropy and R\'enyi entropy. The density is assumed to be smooth, have bounded support, and be uniformly bounded from below on this set. Unlike previous $k$-NN estimators of non-linear density functionals, the proposed estimator uses data-splitting and boundary correction to achieve lower mean square error. Specifically, we assume that $T$ i.i.d. samples ${X}_i \in \mathbb{R}^d$ from the density are split into two pieces of cardinality $M$ and $N$ respectively, with $M$ samples used for computing a k-nearest-neighbor density estimate and the remaining $N$ samples used for empirical estimation of the integral of the density functional. By studying the statistical properties of k-NN balls, explicit rates for the bias and variance of the BPI estimator are derived in terms of the sample size, the dimension of the samples and the underlying probability distribution. Based on these results, it is possible to specify optimal choice of tuning parameters $M/T$, $k$ for maximizing the rate of decrease of the mean square error (MSE). The resultant optimized BPI estimator converges faster and achieves lower mean squared error than previous $k$-NN entropy estimators. In addition, a central limit theorem is established for the BPI estimator that allows us to specify tight asymptotic confidence intervals.
Active Learning of Halfspaces under a Margin Assumption
Gonen, Alon, Sabato, Sivan, Shalev-Shwartz, Shai
We derive and analyze a new, efficient, pool-based active learning algorithm for halfspaces, called ALuMA. Most previous algorithms show exponential improvement in the label complexity assuming that the distribution over the instance space is close to uniform. This assumption rarely holds in practical applications. Instead, we study the label complexity under a large-margin assumption -- a much more realistic condition, as evident by the success of margin-based algorithms such as SVM. Our algorithm is computationally efficient and comes with formal guarantees on its label complexity. It also naturally extends to the non-separable case and to non-linear kernels. Experiments illustrate the clear advantage of ALuMA over other active learning algorithms.
Cultural Analytics of Large Datasets from Flickr
Ushizima, Daniela (Lawrence Berkeley National Laboratory) | Manovich, Lev (University of California, San Diego) | Margolis, Todd (University of California, San Diego) | Douglas, Jeremy (Ashford University)
Deluge became a metaphor to describe the amount of information to which we are subjected, and very often we feel we are drowning while our access to information is rising. Devising mechanisms for exploring massive image sets according to perceptual attributes is still a challenge, even more when dealing with user-generated social media content. Such images tend to be heterogenous, and using metadata-only can be misleading. This paper describes a set of tools designed to analyze large sets of user-created art related images using image features describing color, texture, composition and orientation. The proposed pipeline permits to discriminate Flickr groups in terms of feature vectors and clustering parameters. The algorithms are general enough to be applied to other domains in which the main question is about the variability of the images.
The Pulse of News in Social Media: Forecasting Popularity
Bandari, Roja (University of California Los Angeles) | Asur, Sitaram (HP Labs) | Huberman, Bernardo A (HP Labs)
News articles are extremely time sensitive by nature. There is also intense competition among news items to propagate as widely as possible. Hence, the task of predicting the popularity of news items on the social web is both interesting and challenging. Prior research has dealt with predicting eventual online popularity based on early popularity. It is most desirable, however, to predict the popularity of items prior to their release, fostering the possibility of appropriate decision making to modify an article and the manner of its publication. In this paper, we construct a multi-dimensional feature space derived from properties of an article and evaluate the efficacy of these features to serve as predictors of online popularity. We examine both regression and classification algorithms and demonstrate that despite randomness in human behavior, it is possible to predict ranges of popularity on twitter with an overall 84% accuracy. Our study also serves to illustrate the differences between traditionally prominent sources and those immensely popular on the social web.
Social Media and Citizen Engagement in a City-State: A Study of Singapore
Skoric, Marko M. (Nanyang Technological University) | Pan, Ji (Nanyang Technological University) | Poor, Nathaniel D (Independent Scholar)
Social media plays an important role in the process of political engagement, especially in societies where significant constraints over traditional media and participation still exist. Little is known about how social media use is related to these constraints. This study examines how citizensโ perceptions of government control predict social media use and how this use is related to offline participation in the context of a city-state, Singapore. Based on a national survey of 2000 respondents, we found that perceptions of control over traditional media and political activity increase content production on social media and that perceived control of the mass media motivates citizens to consume political content on social media. Interestingly, perceptions of government control over the Internet reduced rather than increased social media production. More importantly, we find that social media use is related to a greater likelihood of offline citizen participation, namely attendance of political rallies. The findings suggest that social media alters the balance of power in the dependency relationships that exist between the government, media organizations and citizens, creating new venues for online political discourse which in turn help promote real-world political participation.
A Systematic Investigation of Blocking Strategies for Real-Time Classification of Social Media Content into Events
Reuter, Timo (CITEC, Universität Bielefeld) | Cimiano, Philipp (CITEC, Universität Bielefeld)
Events play a prominent role in our lives, such that many social media documents describe or are related to some event. Organizing social media documents with respect to events thus seems a promising approach to better manage and organize the ever-increasing amount of user-generated content in social media applications. It would support the navigation of data by events or allow one to get notified about new postings related to the events one is interested in, just to name two applications. A challenge is to automatize this process so that incoming documents can be assigned to their corresponding event without any user intervention. We present a system that is able to classify a stream of social media data into a growing and evolving set of events. In order to scale up to the data sizes and data rates in social media applications, the use of a candidate retrieval or blocking step is crucial to reduce the number of events that are considered as potential candidates to which the incoming data point could belong to.In this paper we present and experimentally compare different blocking strategies along their cost vs. effectiveness tradeoff.We show that using a blocking strategy that selects the 60 closest events with respect to upload time, we reach F-Measures of about 85.1% while being able to process the incoming documents within 32ms on average. We thus provide a principled approach supporting to scale up classification of social media documents into events and to process the incoming stream of documents in real time.
Transductive Learning for Real-Time Twitter Search
Zhang, Xin (Graduate University of Chinese Academy of Sciences) | He, Ben (Graduate University of Chinese Academy of Sciences) | Luo, Tiejian (Graduate University of Chinese Academy of Sciences)
Recency is an important dimension of relevance for real-time Twitter search as users tend to be interested in fresh news and events. By incorporating various sources of evidence, the application of learning to rank (LTR) algorithms to real-time Twitter search has shown beneficial in finding not only relevant, but also recent tweets in response to given queries. However, the potential effectiveness brought by LTR may not have been fully exploited due to the lack of labeled data available for properly learning a ranking model, since human labels are expensive in real-world applications. To this end, this paper proposes a transductive algorithm that incrementally aggregate the labeled tweets through an iterative process. Experimental results on the standard Tweets11 dataset show that our approach is able to outperform strong baselines without the use of human labels.