Synthetic data, privacy, and the law


Machine learning can synthesize "almost-but-not-quite replica data" based on real data, facilitating research and data sharing while protecting privacy of the real data, but inconsistent data protection laws can stymie use of this approach. Removal of key information from data can enhance privacy, but this limits data utility and fuels an arms race between deidentification and reidentification. Instead, a generative adversarial network can synthesize data that mimic a protected dataset for analytical purposes but are less likely to reveal any actual private information. Bellovin et al. recommend amendments to privacy statutes that are often too absolute and fail to recognize the protections and analytical potential of this approach.