dummy data
Transformers for Tabular Data: A Training Perspective of Self-Attention via Optimal Transport
Candelieri, Antonio, Quadrio, Alessandro
This thesis examines self-attention training through the lens of Optimal Transport (OT) and develops an OT-based alternative for tabular classification. The study tracks intermediate projections of the self-attention layer during training and evaluates their evolution using discrete OT metrics, including Wasserstein distance, Monge gap, optimality, and efficiency. Experiments are conducted on classification tasks with two and three classes, as well as on a biomedical dataset. Results indicate that the final self-attention mapping often approximates the OT optimal coupling, yet the training trajectory remains inefficient. Pretraining the MLP section on synthetic data partially improves convergence but is sensitive to their initialization. To address these limitations, an OT-based algorithm is introduced: it generates class-specific dummy Gaussian distributions, computes an OT alignment with the data, and trains an MLP to generalize this mapping. The method achieves accuracy comparable to Transformers while reducing computational cost and scaling more efficiently under standardized inputs, though its performance depends on careful dummy-geometry design. All experiments and implementations are conducted in R.
Differentially Private Learned Indexes
Du, Jianzhang, Mudgal, Tilak, Gadre, Rutvi Rahul, Luo, Yukui, Wang, Chenghong
In this paper, we address the problem of efficiently answering predicate queries on encrypted databases, those secured by Trusted Execution Environments (TEEs), which enable untrusted providers to process encrypted user data without revealing its contents. A common strategy in modern databases to accelerate predicate queries is the use of indexes, which map attribute values (keys) to their corresponding positions in a sorted data array. This allows for fast lookup and retrieval of data subsets that satisfy specific predicates. Unfortunately, indexes cannot be directly applied to encrypted databases due to strong data dependent leakages. Recent approaches apply differential privacy (DP) to construct noisy indexes that enable faster access to encrypted data while maintaining provable privacy guarantees. However, these methods often suffer from large storage costs, with index sizes typically scaling linearly with the key space. To address this challenge, we propose leveraging learned indexes, a trending technique that repurposes machine learning models as indexing structures, to build more compact DP indexes.
- North America > United States > New York > Broome County > Binghamton (0.04)
- North America > United States > Indiana (0.04)
Federated Learning under Attack: Improving Gradient Inversion for Batch of Images
Leite, Luiz, Santo, Yuri, Dalmazo, Bruno L., Riker, André
Federated Learning (FL) has emerged as a machine learning approach able to preserve the privacy of user's data. Applying FL, clients train machine learning models on a local dataset and a central server aggregates the learned parameters coming from the clients, training a global machine learning model without sharing user's data. However, the state-of-the-art shows several approaches to promote attacks on FL systems. For instance, inverting or leaking gradient attacks can find, with high precision, the local dataset used during the training phase of the FL. This paper presents an approach, called Deep Leakage from Gradients with Feedback Blending (DLG-FB), which is able to improve the inverting gradient attack, considering the spatial correlation that typically exists in batches of images. The performed evaluation shows an improvement of 19.18% and 48,82% in terms of attack success rate and the number of iterations per attacked image, respectively.
- South America > Brazil > Pará (0.04)
- Oceania > New Zealand > North Island > Auckland Region > Auckland (0.04)
Gradient Inversion of Federated Diffusion Models
Huang, Jiyue, Hong, Chi, Chen, Lydia Y., Roos, Stefanie
Diffusion models are becoming defector generative models, which generate exceptionally high-resolution image data. Training effective diffusion models require massive real data, which is privately owned by distributed parties. Each data party can collaboratively train diffusion models in a federated learning manner by sharing gradients instead of the raw data. In this paper, we study the privacy leakage risk of gradient inversion attacks. First, we design a two-phase fusion optimization, GIDM, to leverage the well-trained generative model itself as prior knowledge to constrain the inversion search (latent) space, followed by pixel-wise fine-tuning. GIDM is shown to be able to reconstruct images almost identical to the original ones. Considering a more privacy-preserving training scenario, we then argue that locally initialized private training noise $\epsilon$ and sampling step t may raise additional challenges for the inversion attack. To solve this, we propose a triple-optimization GIDM+ that coordinates the optimization of the unknown data, $\epsilon$ and $t$. Our extensive evaluation results demonstrate the vulnerability of sharing gradient for data protection of diffusion models, even high-resolution images can be reconstructed with high quality.
- Europe > Austria > Vienna (0.14)
- North America > United States > Louisiana > Orleans Parish > New Orleans (0.04)
- North America > Canada > British Columbia > Metro Vancouver Regional District > Vancouver (0.04)
- (10 more...)
On the Efficiency of Privacy Attacks in Federated Learning
Tabassum, Nawrin, Chow, Ka-Ho, Wang, Xuyu, Zhang, Wenbin, Wu, Yanzhao
Recent studies have revealed severe privacy risks in federated learning, represented by Gradient Leakage Attacks. However, existing studies mainly aim at increasing the privacy attack success rate and overlook the high computation costs for recovering private data, making the privacy attack impractical in real applications. In this study, we examine privacy attacks from the perspective of efficiency and propose a framework for improving the Efficiency of Privacy Attacks in Federated Learning (EPAFL). We make three novel contributions. First, we systematically evaluate the computational costs for representative privacy attacks in federated learning, which exhibits a high potential to optimize efficiency. Second, we propose three early-stopping techniques to effectively reduce the computational costs of these privacy attacks. Third, we perform experiments on benchmark datasets and show that our proposed method can significantly reduce computational costs and maintain comparable attack success rates for state-of-the-art privacy attacks in federated learning. We provide the codes on GitHub at https://github.com/mlsysx/EPAFL.
- North America > United States > Virginia (0.04)
- North America > United States > California > San Diego County > San Diego (0.04)
- Europe > United Kingdom > England > Surrey > Guildford (0.04)
- Asia > China > Hong Kong (0.04)
Introduction to Probabilistic Classification: A Machine Learning Perspective
You are capable of training and evaluating classification models, both linear and non-linear model structures. Now, you want class probabilities instead of class labels. This is the article you are looking for. This article walks you through the different evaluation metrics, its pros and cons and optimal model training for multiple ML models. Imagine creating a model with the sole purpose of classifying cats and dogs.
How to Create Dummy Data in Python
Dummy data is randomly generated data that can be substituted for live data. Whether you are a Developer, Software Engineer, or Data Scientist, sometimes you need dummy data to test what you have built, it can be a web app, mobile app, or machine learning model. If you are using python language, you can use a faker python package to create dummy data of any type, for example, dates, transactions, names, texts, time, and others. Faker is a simple python package that generates fake data with different data types. Faker package is heavily inspired by PHP Faker, Perl Faker, and by Ruby Faker.
- Europe > Slovakia (0.16)
- South America > Brazil (0.05)
- Oceania > Wallis and Futuna (0.05)
- (10 more...)
Machine Learning Algorithms. Here's the End-to-End.
While there are several documents and articles on machine learning algorithms, I wanted to provide a summary of the most common ones I use as a professional data scientist. Additionally, I will include some sample code with dummy data so that you can start executing various models! Whereas unsupervised learning, like the commonly used K-means algorithm, aims to groups similar groups of data together without labels, supervised learning, or classification -- well, classifies data into various categories. A simple example of classification is described below. The classification model learns from the features about the fruits to suggest an input food a fruit label.
Understanding K-Means Clustering using Python the easy way
In the previous article, we studied the k-NN. One thing that I believe is that if we can correlate anything with us or our lives, there are greater chances of understanding the concept. So I will try to explain everything by relating it to humans. It tries to make the inter-cluster data points as similar as possible while also keeping the clusters as different or as far as possible. It assigns data points to a cluster such that the sum of the squared distance between the data points and the cluster's centroid is at the minimum.
- Media > Music (0.40)
- Leisure & Entertainment (0.40)