privacy-preserving machine learning
Conformal Prediction for Privacy-Preserving Machine Learning
Balinsky, Alexander David, Krzeminski, Dominik, Balinsky, Alexander
We investigate the integration of Conformal Prediction (CP) with supervised learning on deterministically encrypted data, aiming to bridge the gap between rigorous uncertainty quantification and privacy-preserving machine learning. Using AES-encrypted variants of the MNIST dataset, we demonstrate that CP methods remain effective even when applied directly in the encrypted domain, owing to the preservation of data exchangeability under fixed-key encryption. We test traditional $p$-value-based against $e$-value-based conformal predictors. Our empirical evaluation reveals that models trained on deterministically encrypted data retain the ability to extract meaningful structure, achieving 36.88\% test accuracy -- significantly above random guessing (9.56\%) observed with per-instance encryption. Moreover, $e$-value-based CP achieves predictive set coverage of over 60\% with 4.3 loss-threshold calibration, correctly capturing the true label in 4888 out of 5000 test cases. In contrast, the $p$-value-based CP yields smaller predictive sets but with reduced coverage accuracy. These findings highlight both the promise and limitations of CP in encrypted data settings and underscore critical trade-offs between prediction set compactness and reliability. %Our work sets a foundation for principled uncertainty quantification in secure, privacy-aware learning systems.
Benchmarking Federated Machine Unlearning methods for Tabular Data
Xiao, Chenguang, Ghosh, Abhirup, Wu, Han, Wang, Shuo, van Thiel, Diederick
Machine unlearning, which enables a model to forget specific data upon request, is increasingly relevant in the era of privacy-centric machine learning, particularly within federated learning (FL) environments. This paper presents a pioneering study on benchmarking machine unlearning methods within a federated setting for tabular data, addressing the unique challenges posed by cross-silo FL where data privacy and communication efficiency are paramount. We explore unlearning at the feature and instance levels, employing both machine learning, random forest and logistic regression models. Our methodology benchmarks various unlearning algorithms, including fine-tuning and gradient-based approaches, across multiple datasets, with metrics focused on fidelity, certifiability, and computational efficiency. Experiments demonstrate that while fidelity remains high across methods, tree-based models excel in certifiability, ensuring exact unlearning, whereas gradient-based methods show improved computational efficiency. This study provides critical insights into the design and selection of unlearning algorithms tailored to the FL environment, offering a foundation for further research in privacy-preserving machine learning.
Assessing the Impact of Image Dataset Features on Privacy-Preserving Machine Learning
Lange, Lucas, Heykeroth, Maurice-Maximilian, Rahm, Erhard
Machine Learning (ML) is crucial in many sectors, including computer vision. However, ML models trained on sensitive data face security challenges, as they can be attacked and leak information. Privacy-Preserving Machine Learning (PPML) addresses this by using Differential Privacy (DP) to balance utility and privacy. This study identifies image dataset characteristics that affect the utility and vulnerability of private and non-private Convolutional Neural Network (CNN) models. Through analyzing multiple datasets and privacy budgets, we find that imbalanced datasets increase vulnerability in minority classes, but DP mitigates this issue. Datasets with fewer classes improve both model utility and privacy, while high entropy or low Fisher Discriminant Ratio (FDR) datasets deteriorate the utility-privacy trade-off. These insights offer valuable guidance for practitioners and researchers in estimating and optimizing the utility-privacy trade-off in image datasets, helping to inform data and privacy modifications for better outcomes based on dataset characteristics.
Robust Representation Learning for Privacy-Preserving Machine Learning: A Multi-Objective Autoencoder Approach
Ouaari, Sofiane, รnal, Ali Burak, Akgรผn, Mete, Pfeifer, Nico
Several domains increasingly rely on machine learning in their applications. The resulting heavy dependence on data has led to the emergence of various laws and regulations around data ethics and privacy and growing awareness of the need for privacy-preserving machine learning (ppML). Current ppML techniques utilize methods that are either purely based on cryptography, such as homomorphic encryption, or that introduce noise into the input, such as differential privacy. The main criticism given to those techniques is the fact that they either are too slow or they trade off a model s performance for improved confidentiality. To address this performance reduction, we aim to leverage robust representation learning as a way of encoding our data while optimizing the privacy-utility trade-off. Our method centers on training autoencoders in a multi-objective manner and then concatenating the latent and learned features from the encoding part as the encoded form of our data. Such a deep learning-powered encoding can then safely be sent to a third party for intensive training and hyperparameter tuning. With our proposed framework, we can share our data and use third party tools without being under the threat of revealing its original form. We empirically validate our results on unimodal and multimodal settings, the latter following a vertical splitting system and show improved performance over state-of-the-art.
Privacy-Preserving Machine Learning for Collaborative Data Sharing via Auto-encoder Latent Space Embeddings
Quintero-Ossa, Ana Marรญa, Solano, Jesรบs, Jarcรญa, Hernรกn, Zarruk, David, Bahnsen, Alejandro Correa, Valencia, Carlos
Privacy-preserving machine learning in data-sharing processes is an ever-critical task that enables collaborative training of Machine Learning (ML) models without the need to share the original data sources. It is especially relevant when an organization must assure that sensitive data remains private throughout the whole ML pipeline, i.e., training and inference phases. This paper presents an innovative framework that uses Representation Learning via autoencoders to generate privacy-preserving embedded data. Thus, organizations can share the data representation to increase machine learning models' performance in scenarios with more than one data source for a shared predictive downstream task.
Federated Learning and Privacy
Machine learning and data science are key tools in science, public policy, and the design of products and services thanks to the increasing affordability of collecting, storing, and processing large quantities of data. But centralized collection can expose individuals to privacy risks and organizations to legal risks if data is not properly managed. Starting with early work in 2016,13,15 an expanding community of researchers has explored how data ownership and provenance can be made first-class concepts in systems for learning and analytics in areas now known as federated learning (FL) and federated analytics (FA). With this expanding community, interest has broadened from the initial work on federations of mobile devices to include FL across organizational silos, Internet of Things (IoT) devices, and more. In light of this, Kairouz et al.10 proposed a broader definition: Federated learning is a machine learning setting where multiple entities (clients) collaborate in solving a machine learning problem, under the coordination of a central server or service provider. Each client's raw data is stored locally and not exchanged or transferred; instead, focused updates intended for immediate aggregation are used to achieve the learning objective. An approach very similar in both philosophy and implementation, federated analytics17 can be taken to allow data scientists to generate analytical insight from the combined information in decentralized datasets. While the focus here is on FL, much of the discussion on technology and privacy applies equally well to FA use cases.
Council Post: Where Insights Meet Privacy: Privacy-Preserving Machine Learning
Artificial intelligence (AI) and machine learning (ML) have the power to deliver business value and impact across a wide range of use cases, which has led to their rapidly increasing deployment across verticals. For example, the financial services industry is investing significantly in leveraging machine learning to monetize data assets, improve customer experience and enhance operational efficiencies. According to the World Economic Forum's 2020 "Global AI in Financial Services Survey," AI and ML are expected to "reach ubiquitous importance within two years." However, as the rise and adoption of AI/ML parallels that of global privacy demand and regulation, businesses must be mindful of the security and privacy considerations associated with leveraging machine learning. The implications of these regulations affect the collaborative use of AI/ML not only between entities but also internally, as they limit an organization's ability to use and share data between business segments and jurisdictions.
Challenges of Privacy-Preserving Machine Learning in IoT
Zheng, Mengyao, Xu, Dixing, Jiang, Linshan, Gu, Chaojie, Tan, Rui, Cheng, Peng
The Internet of Things (IoT) will be a main data generation infrastructure for achieving better system intelligence. However, the extensive data collection and processing in IoT also engender various privacy concerns. This paper provides a taxonomy of the existing privacy-preserving machine learning approaches developed in the context of cloud computing and discusses the challenges of applying them in the context of IoT. Moreover, we present a privacy-preserving inference approach that runs a lightweight neural network at IoT objects to obfuscate the data before transmission and a deep neural network in the cloud to classify the obfuscated data. Evaluation based on the MNIST dataset shows satisfactory performance.