Exact Exponent in Optimal Rates for Crowdsourcing

arXiv.org Machine Learning

In many machine learning applications, crowdsourcing has become the primary means for label collection. In this paper, we study the optimal error rate for aggregating labels provided by a set of non-expert workers. Under the classic Dawid-Skene model, we establish matching upper and lower bounds with an exact exponent $mI(\pi)$ in which $m$ is the number of workers and $I(\pi)$ the average Chernoff information that characterizes the workers' collective ability. Such an exact characterization of the error exponent allows us to state a precise sample size requirement $m>\frac{1}{I(\pi)}\log\frac{1}{\epsilon}$ in order to achieve an $\epsilon$ misclassification error. In addition, our results imply the optimality of various EM algorithms for crowdsourcing initialized by consistent estimators.


Predicting Appropriate Semantic Web Terms from Words

AAAI Conferences

The Semantic Web language RDF was designed to unambiguously define and use ontologies to encode data and knowledge on the Web. Many people find it difficult, however, to write complex RDF statements and queries because doing so requires familiarity with the appropriate ontologies and the terms they define. We describe a system that suggests appropriate RDF terms given semantically related English words and general domain and context information. We use the Swoogle Semantic Web search engine to provide RDF term and namespace statistics, the WorldNet lexical ontology to find semantically related words, and a naïve Bayes classifier to suggest terms. A customized graph data structure of related namespaces is constructed from Swoogle's database to speed up the classifier model learning and prediction time.


Eliciting Categorical Data for Optimal Aggregation

Neural Information Processing Systems

Models for collecting and aggregating categorical data on crowdsourcing platforms typically fall into two broad categories: those assuming agents honest and consistent but with heterogeneous error rates, and those assuming agents strategic and seek to maximize their expected reward. The former often leads to tractable aggregation of elicited data, while the latter usually focuses on optimal elicitation and does not consider aggregation. In this paper, we develop a Bayesian model, wherein agents have differing quality of information, but also respond to incentives. Our model generalizes both categories and enables the joint exploration of optimal elicitation and aggregation. This model enables our exploration, both analytically and experimentally, of optimal aggregation of categorical data and optimal multiple-choice interface design.


Framework and Schema for Semantic Web Knowledge Bases

AAAI Conferences

There is a growing need for scalable semantic web repositories which support inference and provide efficient queries. There is also a growing interest in representing uncertain knowledge in semantic web datasets and ontologies. In this paper, I present a bit vector schema specifically designed for RDF (Resource Description Framework) datasets. I propose a system for materializing and storing inferred knowledge using this schema. I show experimental results that demonstrate that this solution simplifies inference queries and drastically improves results. I also propose and describe a solution for materializing and persisting uncertain information and probabilities. Thresholds and bit vectors are used to provide efficient query access to this uncertain knowledge. My goal is to provide a semantic web repository that supports knowledge inference, uncertainty reasoning, and Bayesian networks, without sacrificing performance or scalability.


Novel Sensor Scheduling Scheme for Intruder Tracking in Energy Efficient Sensor Networks

arXiv.org Artificial Intelligence

We consider the problem of tracking an intruder using a network of wireless sensors. For tracking the intruder at each instant, the optimal number and the right configuration of sensors has to be powered. As powering the sensors consumes energy, there is a trade off between accurately tracking the position of the intruder at each instant and the energy consumption of sensors. This problem has been formulated in the framework of Partially Observable Markov Decision Process (POMDP). Even for the simplest model considered in [1], the curse of dimensionality renders the problem intractable. We formulate this problem with a suitable state-action space in the framework of POMDP and develop a reinforcement learning algorithm utilising the Upper Confidence Tree Search (UCT) method to mitigate the state-action space explosion. Through simulations, we illustrate that our algorithm scales well with the increasing state and action space.