Generating datasets that "look like" given real ones is an interesting tasks for healthcare applications of ML and many other fields of science and engineering. In this paper we propose a new method of general application to binary datasets based on a method for learning the parameters of a latent variable moment that we have previously used for clustering patient datasets. We compare our method with a recent proposal (MedGan) based on generative adversarial methods and find that the synthetic datasets we generate are globally more realistic in at least two senses: real and synthetic instances are harder to tell apart by Random Forests, and the MMD statistic. The most likely explanation is that our method does not suffer from the "mode collapse" which is an admitted problem of GANs. Additionally, the generative models we generate are easy to interpret, unlike the rather obscure GANs. Our experiments are performed on two patient datasets containing ICD-9 diagnostic codes: the publicly available MIMIC-III dataset and a dataset containing admissions for congestive heart failure during 7 years at Hospital de Sant Pau in Barcelona.
Releasing full data records is one of the most challenging problems in data privacy. On the one hand, many of the popular techniques such as data de-identification are problematic because of their dependence on the background knowledge of adversaries. On the other hand, rigorous methods such as the exponential mechanism for differential privacy are often computationally impractical to use for releasing high dimensional data or cannot preserve high utility of original data due to their extensive data perturbation. This paper presents a criterion called plausible deniability that provides a formal privacy guarantee, notably for releasing sensitive datasets: an output record can be released only if a certain amount of input records are indistinguishable, up to a privacy parameter. This notion does not depend on the background knowledge of an adversary. Also, it can efficiently be checked by privacy tests. We present mechanisms to generate synthetic datasets with similar statistical properties to the input data and the same format. We study this technique both theoretically and experimentally. A key theoretical result shows that, with proper randomization, the plausible deniability mechanism generates differentially private synthetic data. We demonstrate the efficiency of this generative technique on a large dataset; it is shown to preserve the utility of original data with respect to various statistical analysis and machine learning measures.
Data is the new oil and truth be told only a few big players have the strongest hold on that currency. Googles and Facebooks of this world are so generous with their latest machine learning algorithms and packages (they give those away freely) because the entry barrier to the world of algorithms is pretty low right now. Open source has come a long way from being christened evil by the likes of Steve Ballmer to being an integral part of Microsoft. And plenty of open source initiatives are propelling the vehicles of data science, digital analytics, and machine learning. Standing in 2018 we can safely say that, algorithm, programming frameworks, and machine learning packages (or even tutorials and courses how to learn these techniques) are not the scarce resource but high-quality data is.
Much has been said about how big data will help solve many of the world's thorniest problems, including pandemics, hunger, cancer treatments, and conservation. However, because of the seriousness of the problems, and complexity of big data and its analysis, a great deal of testing is required before any results can be considered trustworthy. Unfortunately, most businesses and organizations do not have the in-house capability to achieve any semblance of trust. Thus, the normal procedure has been to outsource the work to third-party vendors. The operative phrase is "has been."
Corporate machine learning research may be getting a new vanguard in Apple. Six researchers from the company's recently formed machine learning group published a paper that describes a novel method for simulated unsupervised learning. The aim is to improve the quality of synthetic training images. The work is a sign of the company's aspirations to become a more visible leader in the ever growing field of AI. Google, Facebook, Microsoft and the rest of the techstablishment have been steadily growing their machine learning research groups.