AITopics

1202.6548

Country:

Europe > Italy (0.17)
North America > United States (0.15)

Genre:

Research Report (0.84)
Workflow (0.75)

Industry: Health & Medicine > Pharmaceuticals & Biotechnology (1.00)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)

arXiv.org Machine LearningMar-1-2012

Risk Bounds for CART Classifiers under a Margin Condition

Gey, Servane

Risk bounds for Classification and Regression Trees (CART, Breiman et. al. 1984) classifiers are obtained under a margin condition in the binary supervised classification framework. These risk bounds are obtained conditionally on the construction of the maximal deep binary tree and permit to prove that the linear penalty used in the CART pruning algorithm is valid under a margin condition. It is also shown that, conditionally on the construction of the maximal tree, the final selection by test sample does not alter dramatically the estimation accuracy of the Bayes classifier. In the two-class classification framework, the risk bounds that are proved, obtained by using penalized model selection, validate the CART algorithm which is used in many data mining applications such as Biology, Medicine or Image Coding.

algorithm, artificial intelligence, machine learning, (20 more...)

doi: 10.1016/j.patcog.2012.02.021

0902.3130

Country: Europe (0.46)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Decision Tree Learning (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.66)

Daume, Hal III, Phillips, Jeff M., Saha, Avishek, Venkatasubramanian, Suresh

Protocols for Learning Classifiers on Distributed Data

arXiv.org Machine LearningFeb-27-2012

We consider the problem of learning classifiers for labeled data that has been distributed across several nodes. Our goal is to find a single classifier, with small approximation error, across all datasets while minimizing the communication between nodes. This setting models real-world communication bottlenecks in the processing of massive distributed datasets. We present several very general sampling-based solutions as well as some two-way protocols which have a provable exponential speed-up over any one-way protocol. We focus on core problems for noiseless data distributed across two or more nodes. The techniques we introduce are reminiscent of active learning, but rather than actively probing labels, nodes actively communicate with each other, each node simultaneously learning the important data from another node.

artificial intelligence, classifier, machine learning, (16 more...)

1202.6078

Country: North America > United States > California (0.46)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Computational Learning Theory (0.46)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.46)

Sricharan, Kumar, Raich, Raviv, Hero, Alfred O. III

Empirical estimation of entropy functionals with confidence

arXiv.org Machine LearningFeb-25-2012

This paper introduces a class of k-nearest neighbor ($k$-NN) estimators called bipartite plug-in (BPI) estimators for estimating integrals of non-linear functions of a probability density, such as Shannon entropy and R\'enyi entropy. The density is assumed to be smooth, have bounded support, and be uniformly bounded from below on this set. Unlike previous $k$-NN estimators of non-linear density functionals, the proposed estimator uses data-splitting and boundary correction to achieve lower mean square error. Specifically, we assume that $T$ i.i.d. samples ${X}_i \in \mathbb{R}^d$ from the density are split into two pieces of cardinality $M$ and $N$ respectively, with $M$ samples used for computing a k-nearest-neighbor density estimate and the remaining $N$ samples used for empirical estimation of the integral of the density functional. By studying the statistical properties of k-NN balls, explicit rates for the bias and variance of the BPI estimator are derived in terms of the sample size, the dimension of the samples and the underlying probability distribution. Based on these results, it is possible to specify optimal choice of tuning parameters $M/T$, $k$ for maximizing the rate of decrease of the mean square error (MSE). The resultant optimized BPI estimator converges faster and achieves lower mean squared error than previous $k$-NN entropy estimators. In addition, a central limit theorem is established for the BPI estimator that allows us to specify tight asymptotic confidence intervals.

artificial intelligence, estimator, machine learning, (19 more...)

1012.4188

Country: North America > United States (0.27)

Genre: Research Report (0.81)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Nearest Neighbor Methods (0.94)

Gonen, Alon, Sabato, Sivan, Shalev-Shwartz, Shai

Active Learning of Halfspaces under a Margin Assumption

arXiv.org Machine LearningFeb-24-2012

We derive and analyze a new, efficient, pool-based active learning algorithm for halfspaces, called ALuMA. Most previous algorithms show exponential improvement in the label complexity assuming that the distribution over the instance space is close to uniform. This assumption rarely holds in practical applications. Instead, we study the label complexity under a large-margin assumption -- a much more realistic condition, as evident by the success of margin-based algorithms such as SVM. Our algorithm is computationally efficient and comes with formal guarantees on its label complexity. It also naturally extends to the non-separable case and to non-linear kernels. Experiments illustrate the clear advantage of ALuMA over other active learning algorithms.

algorithm, artificial intelligence, machine learning, (14 more...)

1112.1556

Genre: Research Report (0.82)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.68)

Cultural Analytics of Large Datasets from Flickr

Ushizima, Daniela (Lawrence Berkeley National Laboratory) | Manovich, Lev (University of California, San Diego) | Margolis, Todd (University of California, San Diego) | Douglas, Jeremy (Ashford University)

Deluge became a metaphor to describe the amount of information to which we are subjected, and very often we feel we are drowning while our access to information is rising. Devising mechanisms for exploring massive image sets according to perceptual attributes is still a challenge, even more when dealing with user-generated social media content. Such images tend to be heterogenous, and using metadata-only can be misleading. This paper describes a set of tools designed to analyze large sets of user-created art related images using image features describing color, texture, composition and orientation. The proposed pipeline permits to discriminate Flickr groups in terms of feature vectors and clustering parameters. The algorithms are general enough to be applied to other domains in which the main question is about the variability of the images.

dataset, midstream oil & gas, social media, (17 more...)

Sixth International AAAI Conference on Weblogs and Social Media

Country: North America > United States > California (0.29)

Industry:

Information Technology > Services (0.67)
Energy > Oil & Gas > Midstream (0.34)

Technology:

Information Technology > Sensing and Signal Processing > Image Processing (1.00)
Information Technology > Data Science (1.00)
Information Technology > Communications > Social Media (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)

The Pulse of News in Social Media: Forecasting Popularity

Bandari, Roja (University of California Los Angeles) | Asur, Sitaram (HP Labs) | Huberman, Bernardo A (HP Labs)

News articles are extremely time sensitive by nature. There is also intense competition among news items to propagate as widely as possible. Hence, the task of predicting the popularity of news items on the social web is both interesting and challenging. Prior research has dealt with predicting eventual online popularity based on early popularity. It is most desirable, however, to predict the popularity of items prior to their release, fostering the possibility of appropriate decision making to modify an article and the manner of its publication. In this paper, we construct a multi-dimensional feature space derived from properties of an article and evaluate the efficacy of these features to serve as predictors of online popularity. We examine both regression and classification algorithms and demonstrate that despite randomness in human behavior, it is possible to predict ranges of popularity on twitter with an overall 84% accuracy. Our study also serves to illustrate the differences between traditionally prominent sources and those immensely popular on the social web.

Sixth International AAAI Conference on Weblogs and Social Media

Country:

North America > United States > California > Santa Clara County > Palo Alto (0.14)
North America > United States > California > Los Angeles County > Los Angeles (0.14)
North America > United States > New York > New York County > New York City (0.04)
Asia > Middle East > Iran (0.04)

Genre: Research Report (1.00)

Industry:

Media > News (1.00)
Information Technology > Services (0.69)

Technology:

Information Technology > Communications > Social Media (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)

Social Media and Citizen Engagement in a City-State: A Study of Singapore

Skoric, Marko M. (Nanyang Technological University) | Pan, Ji (Nanyang Technological University) | Poor, Nathaniel D (Independent Scholar)

Social media plays an important role in the process of political engagement, especially in societies where significant constraints over traditional media and participation still exist. Little is known about how social media use is related to these constraints. This study examines how citizens’ perceptions of government control predict social media use and how this use is related to offline participation in the context of a city-state, Singapore. Based on a national survey of 2000 respondents, we found that perceptions of control over traditional media and political activity increase content production on social media and that perceived control of the mass media motivates citizens to consume political content on social media. Interestingly, perceptions of government control over the Internet reduced rather than increased social media production. More importantly, we find that social media use is related to a greater likelihood of offline citizen participation, namely attendance of political rallies. The findings suggest that social media alters the balance of power in the dependency relationships that exist between the government, media organizations and citizens, creating new venues for online political discourse which in turn help promote real-world political participation.

artificial intelligence, machine learning, social media, (15 more...)

Sixth International AAAI Conference on Weblogs and Social Media

Country:

Europe > United Kingdom > England > Oxfordshire > Oxford (0.14)
North America > United States > New York > New York County > New York City (0.04)
Asia > Middle East > Iran (0.04)
(8 more...)

Genre:

Questionnaire & Opinion Survey (1.00)
Research Report > New Finding (0.66)
Research Report > Experimental Study (0.48)

Industry:

Media > News (1.00)
Government > Voting & Elections (1.00)

Technology:

Information Technology > Communications > Social Media (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.66)

Reuter, Timo (CITEC, Universität Bielefeld) | Cimiano, Philipp (CITEC, Universität Bielefeld)

A Systematic Investigation of Blocking Strategies for Real-Time Classification of Social Media Content into Events

Events play a prominent role in our lives, such that many social media documents describe or are related to some event. Organizing social media documents with respect to events thus seems a promising approach to better manage and organize the ever-increasing amount of user-generated content in social media applications. It would support the navigation of data by events or allow one to get notified about new postings related to the events one is interested in, just to name two applications. A challenge is to automatize this process so that incoming documents can be assigned to their corresponding event without any user intervention. We present a system that is able to classify a stream of social media data into a growing and evolving set of events. In order to scale up to the data sizes and data rates in social media applications, the use of a candidate retrieval or blocking step is crucial to reduce the number of events that are considered as potential candidates to which the incoming data point could belong to.In this paper we present and experimentally compare different blocking strategies along their cost vs. effectiveness tradeoff.We show that using a blocking strategy that selects the 60 closest events with respect to upload time, we reach F-Measures of about 85.1% while being able to process the incoming documents within 32ms on average. We thus provide a principled approach supporting to scale up classification of social media documents into events and to process the incoming stream of documents in real time.

artificial intelligence, machine learning, social media, (18 more...)

Sixth International AAAI Conference on Weblogs and Social Media

Country:

Europe > Germany (0.04)
North America > United States > Florida > Hillsborough County > Tampa (0.04)
North America > United States > District of Columbia > Washington (0.04)

Genre: Research Report (1.00)

Technology:

Information Technology > Communications > Social Media (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.68)

Zhang, Xin (Graduate University of Chinese Academy of Sciences) | He, Ben (Graduate University of Chinese Academy of Sciences) | Luo, Tiejian (Graduate University of Chinese Academy of Sciences)

Transductive Learning for Real-Time Twitter Search

Recency is an important dimension of relevance for real-time Twitter search as users tend to be interested in fresh news and events. By incorporating various sources of evidence, the application of learning to rank (LTR) algorithms to real-time Twitter search has shown beneficial in finding not only relevant, but also recent tweets in response to given queries. However, the potential effectiveness brought by LTR may not have been fully exploited due to the lack of labeled data available for properly learning a ranking model, since human labels are expensive in real-world applications. To this end, this paper proposes a transductive algorithm that incrementally aggregate the labeled tweets through an iterative process. Experimental results on the standard Tweets11 dataset show that our approach is able to outperform strong baselines without the use of human labels.

information retrieval, machine learning, natural language, (15 more...)

Sixth International AAAI Conference on Weblogs and Social Media

Country:

North America > United States > New York (0.04)
Asia > China > Beijing > Beijing (0.04)

Industry: Information Technology (0.47)

Technology:

Information Technology > Communications > Social Media (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.47)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (0.31)