A Convex Formulation for Learning from Crowds

AAAI Conferences

Recently crowdsourcing services are often used to collect a large amount of labeled data for machine learning, since they provide us an easy way to get labels at very low cost and in a short period. The use of crowdsourcing has introduced a new challenge in machine learning, that is, coping with the variable quality of crowd-generated data. Although there have been many recent attempts to address the quality problem of multiple workers, only a few of the existing methods consider the problem of learning classifiers directly from such noisy data. All these methods modeled the true labels as latent variables, which resulted in non-convex optimization problems. In this paper, we propose a convex optimization formulation for learning from crowds without estimating the true labels by introducing personal models of the individual crowd workers. We also devise an efficient iterative method for solving the convex optimization problems by exploiting conditional independence structures in multiple classifiers. We evaluate the proposed method against three competing methods on synthetic data sets and a real crowdsourced data set and demonstrate that the proposed method outperforms the other three methods.


Predictive Mining of Comparable Entities from the Web

AAAI Conferences

Comparing entities is an important part of decision making. Several approaches have been reported for mining comparable entities from Web sources to improve user experience in comparing entities online.However, these efforts extract only entities explicitly compared in the corpora, and may exclude entities that occur less-frequently but potentially comparable. To build a more complete comparison machine that can infer such missing relations, here we develop a solutionto predict transitivity of known comparable relations. Named CliqueGrow, our approach predicts missing links given a comparable entity graph obtained from versus query logs. Our approach achieved the highest F1-score among five link prediction approaches and a commercial comparison engine provided by Yahoo!.


ET-LDA: Joint Topic Modeling for Aligning Events and their Twitter Feedback

AAAI Conferences

During broadcast events such as the Superbowl, the U.S. Presidential and Primary debates, etc., Twitter has become the de facto platform for crowds to share perspectives and commentaries about them. Given an event and an associated large-scale collection of tweets, there are two fundamental research problems that have been receiving increasing attention in recent years. One is to extract the topics covered by the event and the tweets; the other is to segment the event. So far these problems have been viewed separately and studied in isolation. In this work, we argue that these problems are in fact inter-dependent and should be addressed together. We develop a joint Bayesian model that performs topic modeling and event segmentation in one unified framework. We evaluate the proposed model both quantitatively and qualitatively on two large-scale tweet datasets associated with two events from different domains to show that it improves significantly over baseline models.


Music-Inspired Texture Representation

AAAI Conferences

Techniques for music recommendation are increasingly relying on hybrid representations to retrieve new and exciting music. A key component of these representations is musical content, with texture being the most widely used feature. Current techniques for representing texture however are inspired by speech, not music, therefore music representations are not capturing the correct nature of musical texture. In this paper we investigate two parts of the well-established mel-frequency cepstral coefficients (MFCC) representation: the resolution of mel-frequencies related to the resolution of musical notes; and how best to describe the shape of texture. Through contextualizing these parts, and their relationship to music, a novel music-inspired texture representation is developed. We evaluate this new texture representation by applying it to the task of music recommendation. We use the representation to build three recommendation models, based on current state-of-the-art methods. Our results show that by understanding two key parts of texture representation, it is possible to achieve a significant recommendation improvement. This contribution of a music-inspired texture representation will not only improve content-based representation, but will allow hybrid systems to take advantage of a stronger content component.


Online Task Assignment in Crowdsourcing Markets

AAAI Conferences

We explore the problem of assigning heterogeneous tasks to workers with different, unknown skill sets in crowdsourcing markets such as Amazon Mechanical Turk. We first formalize the online task assignment problem, in which a requester has a fixed set of tasks and a budget that specifies how many times he would like each task completed. Workers arrive one at a time (with the same worker potentially arriving multiple times), and must be assigned to a task upon arrival. The goal is to allocate workers to tasks in a way that maximizes the total benefit that the requester obtains from the completed work. Inspired by recent research on the online adwords problem, we present a two-phase exploration-exploitation assignment algorithm and prove that it is competitive with respect to the optimal offline algorithm which has access to the unknown skill levels of each worker. We empirically evaluate this algorithm using data collected on Mechanical Turk and show that it performs better than random assignment or greedy algorithms. To our knowledge, this is the first work to extend the online primal-dual technique used in the online adwords problem to a scenario with unknown parameters, and the first to offer an empirical validation of an online primal-dual algorithm.


Quality Expectation-Variance Tradeoffs in Crowdsourcing Contests

AAAI Conferences

We examine designs for crowdsourcing contests, where participants compete for rewards given to superior solutions of a task. We theoretically analyze tradeoffs between the expectation and variance of the principal's utility (i.e. the best solution's quality), and empirically test our theoretical predictions using a controlled experiment on Amazon Mechanical Turk. Our evaluation method is also crowdsourcing based and relies on the peer prediction mechanism. Our theoretical analysis shows an expectation-variance tradeoff of the principal's utility in such contests through a Pareto efficient frontier. In particular, we show that the simple contest with 2 authors and the 2-pair contest have good theoretical properties. In contrast, our empirical results show that the 2-pair contest is the superior design among all designs tested, achieving the highest expectation and lowest variance of the principal's utility.


Querying Linked Ontological Data through Distributed Summarization

AAAI Conferences

As the semantic web expands, ontological data becomes distributed over a large network of data sources on the Web. Consequently, evaluating queries that aim to tap into this distributed semantic database necessitates the ability to consult multiple data sources efficiently. In this paper, we propose methods and heuristics to efficiently query distributed ontological data based on a series of properties of summarized data. In our approach, each source summarizes its data as another RDF graph, and relevant section of these summaries are merged and analyzed at query evaluation time. We show how the analysis of these summaries enables more efficient source selection, query pruning and transformation of expensive distributed joins into local joins.


CCE: A Coupled Framework of Clustering Ensembles

AAAI Conferences

Clustering ensemble mainly relies on the pairwise similarity to capture the consensus function. However, it usually considers each base clustering independently, and treats the similarity measure roughly with either 0 or 1. To address these two issues, we propose a coupled framework of clustering ensembles CCE, and exemplify it with the coupled version CCSPA for CSPA. Experiments demonstrate the superiority of CCSPA over baseline approaches in terms of the clustering accuracy.



Frugal Coordinate Descent for Large-Scale NNLS

AAAI Conferences

The Nonnegative Least Squares (NNLS) formulation arises in many important regression problems. We present a novel coordinate descent method which differs from previous approaches in that we do not explicitly maintain complete gradient information. Empirical evidence shows that our approach outperforms a state-of-the-art NNLS solver in computation time for calculating radiation dosage for cancer treatment problems.