model trainer
- North America > United States > California > Santa Clara County > Palo Alto (0.04)
- North America > Canada > Ontario > Toronto (0.04)
A Gradient analysis
To better understand why our generated confounder noise can make the data unlearnable, we can also gain some insights according to optimization gradient. Empirically, if one image provides a large gradient in a backpropagation, this image has a lot of learnable knowledge, and vice versa. Figure 9 shows the accuracy curves of our method during the training epoch. Then we give a detailed discussion about this setting. To better understand this adaptive setting, we first illustrate the assumption on the data owner's The model trainer wishes to train a denoiser against the noise generated by the ConfounderGAN.
- North America > United States > California > Santa Clara County > Palo Alto (0.04)
- North America > Canada > Ontario > Toronto (0.04)
Genome-Factory: An Integrated Library for Tuning, Deploying, and Interpreting Genomic Models
Wu, Weimin, Song, Xuefeng, Wen, Yibo, Lin, Qinjie, Zhou, Zhihan, Hu, Jerry Yao-Chieh, Wang, Zhong, Liu, Han
We introduce Genome-Factory, an integrated Python library for tuning, deploying, and interpreting genomic models. Our core contribution is to simplify and unify the workflow for genomic model development: data collection, model tuning, inference, benchmarking, and interpretability. For data collection, Genome-Factory offers an automated pipeline to download genomic sequences and preprocess them. It also includes quality control, such as GC content normalization. For model tuning, Genome-Factory supports three approaches: full-parameter, low-rank adaptation, and adapter-based fine-tuning. It is compatible with a wide range of genomic models. For inference, Genome-Factory enables both embedding extraction and DNA sequence generation. For benchmarking, we include two existing benchmarks and provide a flexible interface for users to incorporate additional benchmarks. For interpretability, Genome-Factory introduces the first open-source biological interpreter based on a sparse auto-encoder. This module disentangles embeddings into sparse, near-monosemantic latent units and links them to interpretable genomic features by regressing on external readouts. To improve accessibility, Genome-Factory features both a zero-code command-line interface and a user-friendly web interface. We validate the utility of Genome-Factory across three dimensions: (i) Compatibility with diverse models and fine-tuning methods; (ii) Benchmarking downstream performance using two open-source benchmarks; (iii) Biological interpretation of learned representations with DNABERT-2. These results highlight its end-to-end usability and practical value for real-world genomic analysis.
- North America > United States > Illinois > Cook County > Chicago (0.04)
- North America > United States > California (0.04)
zkUnlearner: A Zero-Knowledge Framework for Verifiable Unlearning with Multi-Granularity and Forgery-Resistance
Wang, Nan, Wu, Nan, Hui, Xiangyu, Wang, Jiafan, Yuan, Xin
As the demand for exercising the "right to be forgotten" grows, the need for verifiable machine unlearning has become increasingly evident to ensure both transparency and accountability. We present {\em zkUnlearner}, the first zero-knowledge framework for verifiable machine unlearning, specifically designed to support {\em multi-granularity} and {\em forgery-resistance}. First, we propose a general computational model that employs a {\em bit-masking} technique to enable the {\em selectivity} of existing zero-knowledge proofs of training for gradient descent algorithms. This innovation enables not only traditional {\em sample-level} unlearning but also more advanced {\em feature-level} and {\em class-level} unlearning. Our model can be translated to arithmetic circuits, ensuring compatibility with a broad range of zero-knowledge proof systems. Furthermore, our approach overcomes key limitations of existing methods in both efficiency and privacy. Second, forging attacks present a serious threat to the reliability of unlearning. Specifically, in Stochastic Gradient Descent optimization, gradients from unlearned data, or from minibatches containing it, can be forged using alternative data samples or minibatches that exclude it. We propose the first effective strategies to resist state-of-the-art forging attacks. Finally, we benchmark a zkSNARK-based instantiation of our framework and perform comprehensive performance evaluations to validate its practicality.
- North America > United States (0.04)
- Oceania > Australia (0.04)
- Law (1.00)
- Information Technology > Security & Privacy (1.00)
Machine Learning Models Have a Supply Chain Problem
Meiklejohn, Sarah, Blauzvern, Hayden, Maruseac, Mihai, Schrock, Spencer, Simon, Laurent, Shumailov, Ilia
Powerful machine learning (ML) models are now readily available online, which creates exciting possibilities for users who lack the deep technical expertise or substantial computing resources needed to develop them. On the other hand, this type of open ecosystem comes with many risks. In this paper, we argue that the current ecosystem for open ML models contains significant supply-chain risks, some of which have been exploited already in real attacks. These include an attacker replacing a model with something malicious (e.g., malware), or a model being trained using a vulnerable version of a framework or on restricted or poisoned data. We then explore how Sigstore, a solution designed to bring transparency to open-source software supply chains, can be used to bring transparency to open ML models, in terms of enabling model publishers to sign their models and prove properties about the datasets they use.
- Europe (0.14)
- North America > United States > California > Santa Clara County > Palo Alto (0.04)
- North America > Canada (0.04)
- Information Technology > Security & Privacy (1.00)
- Law (0.93)
M^3Builder: A Multi-Agent System for Automated Machine Learning in Medical Imaging
Feng, Jinghao, Zheng, Qiaoyu, Wu, Chaoyi, Zhao, Ziheng, Zhang, Ya, Wang, Yanfeng, Xie, Weidi
Agentic AI systems have gained significant attention for their ability to autonomously perform complex tasks. However, their reliance on well-prepared tools limits their applicability in the medical domain, which requires to train specialized models. In this paper, we make three contributions: (i) We present M3Builder, a novel multi-agent system designed to automate machine learning (ML) in medical imaging. At its core, M3Builder employs four specialized agents that collaborate to tackle complex, multi-step medical ML workflows, from automated data processing and environment configuration to self-contained auto debugging and model training. These agents operate within a medical imaging ML workspace, a structured environment designed to provide agents with free-text descriptions of datasets, training codes, and interaction tools, enabling seamless communication and task execution. (ii) To evaluate progress in automated medical imaging ML, we propose M3Bench, a benchmark comprising four general tasks on 14 training datasets, across five anatomies and three imaging modalities, covering both 2D and 3D data. (iii) We experiment with seven state-of-the-art large language models serving as agent cores for our system, such as Claude series, GPT-4o, and DeepSeek-V3. Compared to existing ML agentic designs, M3Builder shows superior performance on completing ML tasks in medical imaging, achieving a 94.29% success rate using Claude-3.7-Sonnet as the agent core, showing huge potential towards fully automated machine learning in medical imaging.
- Health & Medicine > Health Care Technology (1.00)
- Health & Medicine > Diagnostic Medicine > Imaging (1.00)
MYCROFT: Towards Effective and Efficient External Data Augmentation
Sarwar, Zain, Tran, Van, Bhagoji, Arjun Nitin, Feamster, Nick, Zhao, Ben Y., Chakraborty, Supriyo
Machine learning (ML) models often require large amounts of data to perform well. When the available data is limited, model trainers may need to acquire more data from external sources. Often, useful data is held by private entities who are hesitant to share their data due to propriety and privacy concerns. This makes it challenging and expensive for model trainers to acquire the data they need to improve model performance. To address this challenge, we propose Mycroft, a data-efficient method that enables model trainers to evaluate the relative utility of different data sources while working with a constrained data-sharing budget. By leveraging feature space distances and gradient matching, Mycroft identifies small but informative data subsets from each owner, allowing model trainers to maximize performance with minimal data exposure. Experimental results across four tasks in two domains show that Mycroft converges rapidly to the performance of the full-information baseline, where all data is shared. Moreover, Mycroft is robust to noise and can effectively rank data owners by utility. Mycroft can pave the way for democratized training of high performance ML models.
- Europe > Switzerland > Zürich > Zürich (0.14)
- North America > United States > New York > New York County > New York City (0.04)
- North America > United States > Illinois > Cook County > Chicago (0.04)
- Europe > Netherlands > North Holland > Amsterdam (0.04)
- Information Technology > Security & Privacy (1.00)
- Health & Medicine (1.00)
Influence-based Attributions can be Manipulated
Yadav, Chhavi, Wu, Ruihan, Chaudhuri, Kamalika
Influence Functions are a standard tool for attributing predictions to training data in a principled manner and are widely used in applications such as data valuation and fairness. In this work, we present realistic incentives to manipulate influencebased attributions and investigate whether these attributions can be systematically tampered by an adversary. We show that this is indeed possible and provide efficient attacks with backward-friendly implementations. Our work raises questions on the reliability of influence-based attributions under adversarial circumstances.
- North America > United States > California > San Diego County > San Diego (0.04)
- Europe > France (0.04)
Data Isotopes for Data Provenance in DNNs
Wenger, Emily, Li, Xiuyu, Zhao, Ben Y., Shmatikov, Vitaly
Today, creators of data-hungry deep neural networks (DNNs) scour the Internet for training fodder, leaving users with little control over or knowledge of when their data is appropriated for model training. To empower users to counteract unwanted data use, we design, implement and evaluate a practical system that enables users to detect if their data was used to train an DNN model. We show how users can create special data points we call isotopes, which introduce "spurious features" into DNNs during training. With only query access to a trained model and no knowledge of the model training process, or control of the data labels, a user can apply statistical hypothesis testing to detect if a model has learned the spurious features associated with their isotopes by training on the user's data. This effectively turns DNNs' vulnerability to memorization and spurious correlations into a tool for data provenance. Our results confirm efficacy in multiple settings, detecting and distinguishing between hundreds of isotopes with high accuracy. We further show that our system works on public ML-as-a-service platforms and larger models such as ImageNet, can use physical objects instead of digital marks, and remains generally robust against several adaptive countermeasures.
- Information Technology > Security & Privacy (1.00)
- Health & Medicine (0.67)
- Law (0.67)
ConfounderGAN: Protecting Image Data Privacy with Causal Confounder
Tian, Qi, Kuang, Kun, Jiang, Kelu, Liu, Furui, Wang, Zhihua, Wu, Fei
The success of deep learning is partly attributed to the availability of massive data downloaded freely from the Internet. However, it also means that users' private data may be collected by commercial organizations without consent and used to train their models. Therefore, it's important and necessary to develop a method or tool to prevent unauthorized data exploitation. In this paper, we propose ConfounderGAN, a generative adversarial network (GAN) that can make personal image data unlearnable to protect the data privacy of its owners. Specifically, the noise produced by the generator for each image has the confounder property. It can build spurious correlations between images and labels, so that the model cannot learn the correct mapping from images to labels in this noise-added dataset. Meanwhile, the discriminator is used to ensure that the generated noise is small and imperceptible, thereby remaining the normal utility of the encrypted image for humans. The experiments are conducted in six image classification datasets, consisting of three natural object datasets and three medical datasets. The results demonstrate that our method not only outperforms state-of-the-art methods in standard settings, but can also be applied to fast encryption scenarios. Moreover, we show a series of transferability and stability experiments to further illustrate the effectiveness and superiority of our method.
- Asia > China > Shanghai > Shanghai (0.05)
- Asia > China > Zhejiang Province > Hangzhou (0.04)
- North America > United States > California > Santa Clara County > Palo Alto (0.04)
- (3 more...)
- Research Report > New Finding (0.48)
- Research Report > Promising Solution (0.34)