Latent truth discovery, LTD for short, refers to the problem of aggregating ltiple claims from various sources in order to estimate the plausibility of atements about entities. In the absence of a ground truth, this problem is highly challenging, when some sources provide conflicting claims and others no claims at all. In this work we provide an unsupervised stochastic inference procedure on top of a model that combines restricted Boltzmann machines with feed-forward neural networks to accurately infer the reliability of sources as well as the plausibility of statements about entities. In comparison to prior work our approach stands out (1) by allowing the incorporation of arbitrary features about sources and claims, (2) by generalizing from reliability per source towards a reliability function, and thus (3) enabling the estimation of source reliability even for sources that have provided no or very few claims, (4) by building on efficient and scalable stochastic inference algorithms, and (5) by outperforming the state-of-the-art by a considerable margin.
Twitter has emerged as a new application paradigm of sensing the physical environment by using human as sensors. These human sensed observations are often viewed as binary claims (either true or false). A fundamental challenge on Twitter is how to ascertain the credibility of claims and the reliability of sources without the prior knowledge on either of them beforehand. This challenge is referred to as truth discovery. An important limitation exists in the current Twitter-based truth discovery solutions: they did not explore the theme relevance aspect of claims and the correct claims identified by their solutions can be completely irrelevant to the theme of interests. In this paper, we present a new analytical model that explicitly considers the theme relevance feature of claims in the solutions of truth discovery problem on Twitter. The new model solves a bi-dimensional estimation problem to jointly estimate the correctness and theme relevance of claims as well as the reliability and theme awareness of sources. The new model is compared with the discovery solutions in current literature using three real world datasets collected from Twitter during recent disastrous and emergent events: Paris attack, Oregon shooting, and Baltimore riots, all in 2015. The new model was shown to be effective in terms of finding both correct and relevant claims.
This paper investigates differentially private analysis of distance-based outliers. The problem of outlier detection is to find a small number of instances that are apparently distant from the remaining instances. On the other hand, the objective of differential privacy is to conceal presence (or absence) of any particular instance. Outlier detection and privacy protection are thus intrinsically conflicting tasks. In this paper, instead of reporting outliers detected, we present two types of differentially private queries that help to understand behavior of outliers. One is the query to count outliers, which reports the number of outliers that appear in a given subspace. Our formal analysis on the exact global sensitivity of outlier counts reveals that regular global sensitivity based method can make the outputs too noisy, particularly when the dimensionality of the given subspace is high. Noting that the counts of outliers are typically expected to be relatively small compared to the number of data, we introduce a mechanism based on the smooth upper bound of the local sensitivity. The other is the query to discovery top-$h$ subspaces containing a large number of outliers. This task can be naively achieved by issuing count queries to each subspace in turn. However, the variation of subspaces can grow exponentially in the data dimensionality. This can cause serious consumption of the privacy budget. For this task, we propose an exponential mechanism with a customized score function for subspace discovery. To the best of our knowledge, this study is the first trial to ensure differential privacy for distance-based outlier analysis. We demonstrated our methods with synthesized datasets and real datasets. The experimental results show that out method achieve better utility compared to the global sensitivity based methods.
In this work we address the practical challenges of training machine learning models on privacy-sensitive datasets by introducing a modular approach that minimizes changes to training algorithms, provides a variety of configuration strategies for the privacy mechanism, and then isolates and simplifies the critical logic that computes the final privacy guarantees. A key challenge is that training algorithms often require estimating many different quantities (vectors) from the same set of examples --- for example, gradients of different layers in a deep learning architecture, as well as metrics and batch normalization parameters. Each of these may have different properties like dimensionality, magnitude, and tolerance to noise. By extending previous work on the Moments Accountant for the subsampled Gaussian mechanism, we can provide privacy for such heterogeneous sets of vectors, while also structuring the approach to minimize software engineering challenges.
Differential privacy, a notion of algorithmic stability, is a gold standard for measuring the additional risk an algorithm's output poses to the privacy of a single record in the dataset. Differential privacy is defined as the distance between the output distribution of an algorithm on neighboring datasets that differ in one entry. In this work, we present a novel relaxation of differential privacy, capacity bounded differential privacy, where the adversary that distinguishes output distributions is assumed to be capacity-bounded -- i.e. bounded not in computational power, but in terms of the function class from which their attack algorithm is drawn. We model adversaries in terms of restricted f-divergences between probability distributions, and study properties of the definition and algorithms that satisfy them.