Asia
Crowdsourced Data Analytics: A Case Study of a Predictive Modeling Competition
Baba, Yukino (National Institute of Informatics) | Nori, Nozomi (Kyoto University) | Saito, Shigeru (OPT, Inc.) | Kashima, Hisashi (Kyoto University)
Predictive modeling competitions provide a new data mining approach that leverages crowds of data scientists to examine a wide variety of predictive models and build the best performance model. In this paper, we report the results of a study conducted on CrowdSolving, a platform for predictive modeling competitions in Japan. We hosted a competition on a link prediction task and observed that (i) the prediction performance of the winner significantly outperformed that of a state-of-the-art method, (ii) the aggregated model constructed from all submitted models further improved the final performance, and (iii) the performance of the aggregated model built only from early submissions nevertheless overtook the final performance of the winner.
Predicting Own Action: Self-Fulfilling Prophecy Induced by Proper Scoring Rules
Oka, Masaaki (Kyushu University) | Todo, Taiki (Kyushu University) | Sakurai, Yuko (Kyushu University) | Yokoo, Makoto (Kyushu University)
This paper studies a mechanism to incentivize agents who predict their own future actions and truthfully declare their predictions. In a crowdsouring setting (e.g., participatory sensing), obtaining an accurate prediction of the actions of workers/agents is valuable for a requester who is collecting real-world information from the crowd. If an agent predicts an external event that she cannot control herself (e.g., tomorrow's weather), any proper scoring rule can give an accurate incentive. In our problem setting, an agent needs to predict her own action (e.g., what time tomorrow she will take a photo of a specific place) that she can control to maximize her utility. Also, her (gross) utility can vary based on an eternal event. We first prove that a mechanism can satisfy our goal if and only if it utilizes a strictly proper scoring rule, assuming that an agent can find an optimal declaration that maximizes her expected utility. This declaration is self-fulfilling; if she acts to maximize her utility, the probabilistic distribution of her action matches her declaration, assuming her prediction about the external event is correct. Furthermore, we develop a heuristic algorithm that efficiently finds a semi-optimal declaration, and show that this declaration is still self-fulfilling. We also examine our heuristic algorithm's performance and describe how an agent acts when she faces an unexpected scenario.
To Re(label), or Not To Re(label)
Lin, Christopher H. (University of Washington) | Mausam, . (Indian Institute of Technology, Delhi) | Weld, Daniel S (University of Washington)
One of the most popular uses of crowdsourcing is to provide training data for supervised machine learning algorithms. Since human annotators often make errors, requesters commonly ask multiple workers to label each example. ย But is this strategy always the most cost effective use of crowdsourced workers? We argue "No" --- often classifiers can achieve higher accuracies when trained with noisy "unilabeled" data. However, in some cases relabeling is extremely important. ย We discuss three factors that may make relabeling an effective strategy: classifier expressiveness, worker accuracy, and budget.
Crowdsourced Explanations for Humorous Internet Memes Based on Linguistic Theories
Lin, Chi-Chin (National Taiwan University) | Huang, Yi-Ching (National Taiwan University) | Hsu, Jane Yung-jen (National Taiwan University)
Humorous images can be seen in many social media websites. However, newcomers to these websites often have trouble fitting in because the community subculture is usually implicit. Among all the types of humorous images, Internet memes are relatively hard for newcomers to understand. In this work, we develop a system that leverages crowdsourcing techniques to generate explanations for memes. We claim that people who are not familiar with Internet meme subculture can still quickly pick up the gist of the memes by reading the explanations. Our template-based explanations illustrate the incongruity between normal situations and the punchlines in jokes. The explanations can be produced by completing the two proposed human task processes. Experimental results suggest that the explanations produced by our system greatly help newcomers to understand unfamiliar memes. For further research, it is possible to employ our explanation generation system to improve computational humanities.
Instance-Privacy Preserving Crowdsourcing
Kajino, Hiroshi (The University of Tokyo) | Baba, Yukino (National Institute of Informatics) | Kashima, Hisashi (Kyoto University)
Crowdsourcing is a technique to outsource tasks to a number of workers. Although crowdsourcing has many advantages, it gives rise to the risk that sensitive information may be leaked, which has limited the spread of its popularity. Task instances (data workers receive to process tasks) often contain sensitive information, which can be extracted by workers. For example, in an audio transcription task, an audio file corresponds to an instance, and the content of the audio (e.g., the abstract of a meeting) can be sensitive information. In this paper, we propose a quantitative analysis framework for the instance privacy problem. The proposed framework supplies us performance measures of instance privacy preserving protocols. As a case study, we apply the proposed framework to an instance clipping protocol and analyze the properties of the protocol. The protocol preserves privacy by clipping instances to limit the amount of information workers obtain. The results show that the protocol can balance task performance and instance privacy preservation. They also show that the proposed measure is consistent with standard measures, which validates the proposed measure.
TRACCS: A Framework for Trajectory-Aware Coordinated Urban Crowd-Sourcing
Chen, Cen (Singapore Management University) | Cheng, Shih-Fen (Singapore Management University) | Gunawan, Aldy (Singapore Management University) | Misra, Archan (Singapore Management University) | Dasgupta, Koustuv (Xerox Research Centre India) | Chander, Deepthi (Xerox Research Centre India)
We investigate the problem of large-scale mobile crowd-tasking, where a large pool of citizen crowd-workers are used to perform a variety of location-specific urban logistics tasks. Current approaches to such mobile crowd-tasking are very decentralized: a crowd-tasking platform usually provides each worker a set of available tasks close to the worker's current location; each worker then independently chooses which tasks she wants to accept and perform. In contrast, we propose TRACCS, a more coordinated task assignment approach, where the crowd-tasking platform assigns a sequence of tasks to each worker, taking into account their expected location trajectory over a wider time horizon, as opposed to just instantaneous location. We formulate such task assignment as an optimization problem, that seeks to maximize the total payoff from all assigned tasks, subject to a maximum bound on the detour (from the expected path) that a worker will experience to complete her assigned tasks. We develop credible computationally-efficient heuristics to address this optimization problem (whose exact solution requires solving a complex integer linear program), and show, via simulations with realistic topologies and commuting patterns, that a specific heuristic (called Greedy-ILS) increases the fraction of assigned tasks by more than 20%, and reduces the average detour overhead by more than 60%, compared to the current decentralized approach.
Groupsourcing: Distributed Problem Solving Using Social Networks
Chamberlain, Jon (University of Essex)
Crowdsourcing and citizen science have established themselves in the mainstream of research methodology in recent years, employing a variety of methods to solve problems using human computation. An approach described here, termed "groupsourcing", uses social networks to present problems and collect solutions. This paper details a method for archiving social network messages and investigates messages containing an image classification task in the domain of marine biology. In comparison to other methods, groupsourcing offers a high accuracy, data-driven and low cost approach.
Parallel Task Routing for Crowdsourcing
Bragg, Jonathan (University of Washington) | Kolobov, Andrey (Microsoft Research) | Mausam, Mausam (Indian Institute of Technology, Delhi) | Weld, Daniel S. (University of Washington)
An ideal crowdsourcing or citizen-science system would route tasks to the most appropriate workers, but the best assignment is unclear because workers have varying skill, tasks have varying difficulty, and assigning several workers to a single task may significantly improve output quality. This paper defines a space of task routing problems, proves that even the simplest is NP-hard, and develops several approximation algorithms for parallel routing problems. We show that an intuitive class of requesters' utility functions is submodular, which lets us provide iterative methods for dynamically allocating batches of tasks that make near-optimal use of available workers in each round. Experiments with live oDesk workers show that our task routing algorithm uses only 48% of the human labor compared to the commonly used round-robin strategy. Further, we provide versions of our task routing algorithm which enable it to scale to large numbers of workers and questions and to handle workers with variable response times while still providing significant benefit over common baselines.
Partition-wise Linear Models
Oiwa, Hidekazu, Fujimaki, Ryohei
Region-specific linear models are widely used in practical applications because of their non-linear but highly interpretable model representations. One of the key challenges in their use is non-convexity in simultaneous optimization of regions and region-specific models. This paper proposes novel convex region-specific linear models, which we refer to as partition-wise linear models. Our key ideas are 1) assigning linear models not to regions but to partitions (region-specifiers) and representing region-specific linear models by linear combinations of partition-specific models, and 2) optimizing regions via partition selection from a large number of given partition candidates by means of convex structured regularizations. In addition to providing initialization-free globally-optimal solutions, our convex formulation makes it possible to derive a generalization bound and to use such advanced optimization techniques as proximal methods and decomposition of the proximal maps for sparsity-inducing regularizations. Experimental results demonstrate that our partition-wise linear models perform better than or are at least competitive with state-of-the-art region-specific or locally linear models.
Sparse principal component regression with adaptive loading
Kawano, Shuichi, Fujisawa, Hironori, Takada, Toyoyuki, Shiroishi, Toshihiko
Principal component analysis (PCA) (Jolliffe, 2002) is a fundamental statistical tool for dimensionality reduction, data processing, and visualization of multiv ariate data, with various applications in biology, engineering, and social science. In re gression analysis, it can be useful to replace many original explanatory variables with a f ew principal components, which is called the principal component regression (PCR) (Ma ssy, 1965; Jolliffe, 1982). PCR is widely used in various fields of research and many exten sions of PCR have been proposed (see, e.g., Hartnett et al., 1998; Rosital et al., 2001; Reiss and Ogden, 2007; Wang and Abbott, 2008). Whereas PCR is a useful tool for analyzin g multivariate data, this method may not have enough prediction accuracy if the respon se variable depends on the principal components with small eigenvalues. The problem arises from the two-stage procedure for PCR; a few principal components are selected with la rge eigenvalues, but without any relation to response variable, and then the regression model is constructed using them as new explanatory variables. In this paper, we deal with PCA and regression analysis simultaneous ly, and propose a one-stage procedure for PCR to address this problem. The proc edure combines two loss functions; one is the ordinary regression analysis loss and the othe r is PCA loss with some devices proposed by Zou et al. (2006).