Goto

Collaborating Authors

 unlearnable dataset






What Can We Learn from Unlearnable Datasets?

Neural Information Processing Systems

In an era of widespread web scraping, unlearnable dataset methods have the potential to protect data privacy by preventing deep neural networks from generalizing. But in addition to a number of practical limitations that make their use unlikely, we make a number of findings that call into question their ability to safeguard data. First, it is widely believed that neural networks trained on unlearnable datasets only learn shortcuts, simpler rules that are not useful for generalization. In contrast, we find that networks actually can learn useful features that can be reweighed for high test performance, suggesting that image protection is not assured. Unlearnable datasets are also believed to induce learning shortcuts through linear separability of added perturbations. We provide a counterexample, demonstrating that linear separability of perturbations is not a necessary condition.


Towards Operationalizing Right to Data Protection

Java, Abhinav, Shahid, Simra, Agarwal, Chirag

arXiv.org Artificial Intelligence

The recent success of large language models (LLMs) has exposed the vulnerability of public data as these models are trained on data scraped at scale from public forums and news articles [Touvron et al., 2023] without consent, and the collection of this data remains largely unregulated. As a result, governments worldwide have passed several regulatory frameworks, such as the GDPR [Voigt and Von dem Bussche, 2017] in the EU, the Personal Information Protection and Electronic Documents Act in Canada [PIPEDA], the Data Protection Act in the UK [DPA], the Personal Data Protection Commission (PDPC) [Commission et al., 2022] in Singapore, and the EU AI Act [Neuwirth, 2022], to safeguard algorithmic decisions and data usage practices. The aforementioned legislative frameworks emphasize individuals' rights over how their data is used, even in public contexts. These laws are not limited to private or sensitive data but also encompass the ethical use of publicly accessible information, especially in contexts where such data is used for profiling, decision-making, or large-scale commercial gains. Despite the regulatory efforts, state-of-the-art LLMs are increasingly used in real-world applications to exploit personal data and predict political affiliations [Rozado, 2024, Hernandes, 2024], societal biases [Liang et al., 2021, Dong et al., 2024], and sensitive information of individuals [Wan et al., 2023b, Salewski et al., 2024, Suman et al., 2021], highlighting significant gaps between research and regulatory frameworks. In this work, we aim to make the first attempt to operationalize one principle of "right to protect data" into algorithmic implementation in practice, i.e., people having control over their online data, and propose R


Learning from Convolution-based Unlearnable Datastes

Kim, Dohyun, Sandoval-Segura, Pedro

arXiv.org Artificial Intelligence

The construction of large datasets for deep learning has raised concerns regarding unauthorized use of online data, leading to increased interest in protecting data from third-parties who want to use it for training. The Convolution-based Unlearnable DAtaset (CUDA) method aims to make data unlearnable by applying class-wise blurs to every image in the dataset so that neural networks learn relations between blur kernels and labels, as opposed to informative features for classifying clean data. In this work, we evaluate whether CUDA data remains unlearnable after image sharpening and frequency filtering, finding that this combination of simple transforms improves the utility of CUDA data for training. In particular, we observe a substantial increase in test accuracy over adversarial training for models trained with CUDA unlearnable data from CIFAR-10, CIFAR-100, and ImageNet-100. In training models to high accuracy using unlearnable data, we underscore the need for ongoing refinement in data poisoning techniques to ensure data privacy. Our method opens new avenues for enhancing the robustness of unlearnable datasets by highlighting that simple methods such as sharpening and frequency filtering are capable of breaking convolution-based unlearnable datasets.


Nonlinear Transformations Against Unlearnable Datasets

Hapuarachchi, Thushari, Lin, Jing, Xiong, Kaiqi, Rahouti, Mohamed, Ost, Gitte

arXiv.org Artificial Intelligence

Automated scraping stands out as a common method for collecting data in deep learning models without the authorization of data owners. Recent studies have begun to tackle the privacy concerns associated with this data collection method. Notable approaches include Deepconfuse, error-minimizing, error-maximizing (also known as adversarial poisoning), Neural Tangent Generalization Attack, synthetic, autoregressive, One-Pixel Shortcut, Self-Ensemble Protection, Entangled Features, Robust Error-Minimizing, Hypocritical, and TensorClog. The data generated by those approaches, called "unlearnable" examples, are prevented "learning" by deep learning models. In this research, we investigate and devise an effective nonlinear transformation framework and conduct extensive experiments to demonstrate that a deep neural network can effectively learn from the data/examples traditionally considered unlearnable produced by the above twelve approaches. The resulting approach improves the ability to break unlearnable data compared to the linear separable technique recently proposed by researchers. Specifically, our extensive experiments show that the improvement ranges from 0.34% to 249.59% for the unlearnable CIFAR10 datasets generated by those twelve data protection approaches, except for One-Pixel Shortcut. Moreover, the proposed framework achieves over 100% improvement of test accuracy for Autoregressive and REM approaches compared to the linear separable technique. Our findings suggest that these approaches are inadequate in preventing unauthorized uses of data in machine learning models. There is an urgent need to develop more robust protection mechanisms that effectively thwart an attacker from accessing data without proper authorization from the owners.


PosCUDA: Position based Convolution for Unlearnable Audio Datasets

Gokul, Vignesh, Dubnov, Shlomo

arXiv.org Artificial Intelligence

Deep learning models require large amounts of clean data to acheive good performance. To avoid the cost of expensive data acquisition, researchers use the abundant data available on the internet. This raises significant privacy concerns on the potential misuse of personal data for model training without authorisation. Recent works such as CUDA propose solutions to this problem by adding class-wise blurs to make datasets unlearnable, i.e a model can never use the acquired dataset for learning. However these methods often reduce the quality of the data making it useless for practical applications. We introduce PosCUDA, a position based convolution for creating unlearnable audio datasets. PosCUDA uses class-wise convolutions on small patches of audio. The location of the patches are based on a private key for each class, hence the model learns the relations between positional blurs and labels, while failing to generalize. We empirically show that PosCUDA can achieve unlearnability while maintaining the quality of the original audio datasets. Our proposed method is also robust to different audio feature representations such as MFCC, raw audio and different architectures such as transformers, convolutional networks etc.


What Can We Learn from Unlearnable Datasets?

Sandoval-Segura, Pedro, Singla, Vasu, Geiping, Jonas, Goldblum, Micah, Goldstein, Tom

arXiv.org Artificial Intelligence

In an era of widespread web scraping, unlearnable dataset methods have the potential to protect data privacy by preventing deep neural networks from generalizing. But in addition to a number of practical limitations that make their use unlikely, we make a number of findings that call into question their ability to safeguard data. First, it is widely believed that neural networks trained on unlearnable datasets only learn shortcuts, simpler rules that are not useful for generalization. In contrast, we find that networks actually can learn useful features that can be reweighed for high test performance, suggesting that image protection is not assured. Unlearnable datasets are also believed to induce learning shortcuts through linear separability of added perturbations. We provide a counterexample, demonstrating that linear separability of perturbations is not a necessary condition. To emphasize why linearly separable perturbations should not be relied upon, we propose an orthogonal projection attack which allows learning from unlearnable datasets published in ICML 2021 and ICLR 2023. Our proposed attack is significantly less complex than recently proposed techniques.