Collaborating Authors

On Distributed Differential Privacy and Counting Distinct Elements Machine Learning

We study the setup where each of $n$ users holds an element from a discrete set, and the goal is to count the number of distinct elements across all users, under the constraint of $(\epsilon, \delta)$-differentially privacy: - In the non-interactive local setting, we prove that the additive error of any protocol is $\Omega(n)$ for any constant $\epsilon$ and for any $\delta$ inverse polynomial in $n$. - In the single-message shuffle setting, we prove a lower bound of $\Omega(n)$ on the error for any constant $\epsilon$ and for some $\delta$ inverse quasi-polynomial in $n$. We do so by building on the moment-matching method from the literature on distribution estimation. - In the multi-message shuffle setting, we give a protocol with at most one message per user in expectation and with an error of $\tilde{O}(\sqrt(n))$ for any constant $\epsilon$ and for any $\delta$ inverse polynomial in $n$. Our protocol is also robustly shuffle private, and our error of $\sqrt(n)$ matches a known lower bound for such protocols. Our proof technique relies on a new notion, that we call dominated protocols, and which can also be used to obtain the first non-trivial lower bounds against multi-message shuffle protocols for the well-studied problems of selection and learning parity. Our first lower bound for estimating the number of distinct elements provides the first $\omega(\sqrt(n))$ separation between global sensitivity and error in local differential privacy, thus answering an open question of Vadhan (2017). We also provide a simple construction that gives $\tilde{\Omega}(n)$ separation between global sensitivity and error in two-party differential privacy, thereby answering an open question of McGregor et al. (2011).

Do Embedding Models Perform Well for Knowledge Base Completion? Machine Learning

In this work, we put into question the effectiveness of the evaluation methods currently used to measure the performance of latent factor models for the task of knowledge base completion. We argue that by focusing on a small subset of possible facts in the knowledge base, current evaluation practices are better suited for question answering tasks, rather than knowledge base completion, where it is also important to avoid the addition of incorrect facts into the knowledge base. We illustrate our point by showing how models with limited expressiveness achieve state-of-the-art performance, even while adding many incorrect (even nonsensical) facts to a knowledge base. Finally, we show that when using a simple evaluation procedure designed to also penalize the addition of incorrect facts, the general and relative performance of all models looks very different than previously seen. This indicates the need for more powerful latent factor models for the task of knowledge base completion.

Privacy-preserving Traffic Flow Prediction: A Federated Learning Approach Machine Learning

Existing traffic flow forecasting approaches by deep learning models achieve excellent success based on a large volume of datasets gathered by governments and organizations. However, these datasets may contain lots of user's private data, which is challenging the current prediction approaches as user privacy is calling for the public concern in recent years. Therefore, how to develop accurate traffic prediction while preserving privacy is a significant problem to be solved, and there is a trade-off between these two objectives. To address this challenge, we introduce a privacy-preserving machine learning technique named federated learning and propose a Federated Learning-based Gated Recurrent Unit neural network algorithm (FedGRU) for traffic flow prediction. FedGRU differs from current centralized learning methods and updates universal learning models through a secure parameter aggregation mechanism rather than directly sharing raw data among organizations. In the secure parameter aggregation mechanism, we adopt a Federated Averaging algorithm to reduce the communication overhead during the model parameter transmission process. Furthermore, we design a Joint Announcement Protocol to improve the scalability of FedGRU. We also propose an ensemble clustering-based scheme for traffic flow prediction by grouping the organizations into clusters before applying FedGRU algorithm. Through extensive case studies on a real-world dataset, it is shown that FedGRU's prediction accuracy is 90.96% higher than the advanced deep learning models, which confirm that FedGRU can achieve accurate and timely traffic prediction without compromising the privacy and security of raw data.

A Protocol for a Market of Machine Learning Models


This paper describes a protocol for a market of machine learning models. The economic interaction involves two types of agents: data providers- agents that have some data and want to use it to get a predictive model, and model providers- agents able to use the data to generate predictive models. First, we will show that the process is informationally asymmetric, therefore a standard direct market can not function. The protocol is theoretically analysed, to establish it's correctness and computational complexity. We also propose a simple reference implementation based on a HTTP API.

A Novel Training Protocol for Performance Predictors of Evolutionary Neural Architecture Search Algorithms Artificial Intelligence

Evolutionary Neural Architecture Search (ENAS) can automatically design the architectures of Deep Neural Networks (DNNs) using evolutionary computation algorithms. However, most ENAS algorithms require intensive computational resource, which is not necessarily available to the users interested. Performance predictors are a type of regression models which can assist to accomplish the search, while without exerting much computational resource. Despite various performance predictors have been designed, they employ the same training protocol to build the regression models: 1) sampling a set of DNNs with performance as the training dataset, 2) training the model with the mean square error criterion, and 3) predicting the performance of DNNs newly generated during the ENAS. In this paper, we point out that the three steps constituting the training protocol are not well though-out through intuitive and illustrative examples. Furthermore, we propose a new training protocol to address these issues, consisting of designing a pairwise ranking indicator to construct the training target, proposing to use the logistic regression to fit the training samples, and developing a differential method to building the training instances. To verify the effectiveness of the proposed training protocol, four widely used regression models in the field of machine learning have been chosen to perform the comparisons on two benchmark datasets. The experimental results of all the comparisons demonstrate that the proposed training protocol can significantly improve the performance prediction accuracy against the traditional training protocols.