Goto

Collaborating Authors

 Scott, Mary


Towards Robust Federated Analytics via Differentially Private Measurements of Statistical Heterogeneity

arXiv.org Artificial Intelligence

Statistical heterogeneity is a measure of how skewed the samples of a dataset are. It is a common problem in the study of differential privacy that the usage of a statistically heterogeneous dataset results in a significant loss of accuracy. In federated scenarios, statistical heterogeneity is more likely to happen, and so the above problem is even more pressing. We explore the three most promising ways to measure statistical heterogeneity and give formulae for their accuracy, while simultaneously incorporating differential privacy. We find the optimum privacy parameters via an analytic mechanism, which incorporates root finding methods. We validate the main theorems and related hypotheses experimentally, and test the robustness of the analytic mechanism to different heterogeneity levels. The analytic mechanism in a distributed setting delivers superior accuracy to all combinations involving the classic mechanism and/or the centralized setting. All measures of statistical heterogeneity do not lose significant accuracy when a heterogeneous sample is used.


Distributed, communication-efficient, and differentially private estimation of KL divergence

arXiv.org Artificial Intelligence

Modern applications in data analysis and machine learning work with highdimensional data to support inferences and provide recommendations [1, 2]. Increasingly, the data to support these tasks comes from individuals who hold their data on personal devices such as smartphones and wearables. In the federated model of computation [3, 4], this data remains on the users' devices, which collaborate and cooperate to build accurate models by performing computations and aggregations on their locally held information (e.g., training and fine-tuning small-scale models). A key primitive needed is the ability to compare the distribution of data held by these clients with a reference distribution. For instance, a platform or a service provider would like to know whether the overall behavior of the data is consistent over time for deploying the best fitting and most relevant model. In cases where the data distribution has changed, it may be necessary to trigger model rebuilding or fine-tuning, whereas if there is no change the current model can continue to be used.


A robust synthetic data generation framework for machine learning in High-Resolution Transmission Electron Microscopy (HRTEM)

arXiv.org Artificial Intelligence

Machine learning techniques are attractive options for developing highly-accurate automated analysis tools for nanomaterials characterization, including high-resolution transmission electron microscopy (HRTEM). However, successfully implementing such machine learning tools can be difficult due to the challenges in procuring sufficiently large, high-quality training datasets from experiments. In this work, we introduce Construction Zone, a Python package for rapidly generating complex nanoscale atomic structures, and develop an end-to-end workflow for creating large simulated databases for training neural networks. Construction Zone enables fast, systematic sampling of realistic nanomaterial structures, and can be used as a random structure generator for simulated databases, which is important for generating large, diverse synthetic datasets. Using HRTEM imaging as an example, we train a series of neural networks on various subsets of our simulated databases to segment nanoparticles and holistically study the data curation process to understand how various aspects of the curated simulated data -- including simulation fidelity, the distribution of atomic structures, and the distribution of imaging conditions -- affect model performance across several experimental benchmarks. Using our results, we are able to achieve state-of-the-art segmentation performance on experimental HRTEM images of nanoparticles from several experimental benchmarks and, further, we discuss robust strategies for consistently achieving high performance with machine learning in experimental settings using purely synthetic data.