MAGIC: Near-Optimal Data Attribution for Deep Learning

Apr-23-2025–arXiv.org Machine Learning

A fundamental problem when building machine learning syste ms is to predict counterfactuals about model behavior. For example, scaling laws [ KMH+20; Has21; MRB+23 ] aim to predict the performance of systems trained with more data and more co mpute than is currently available; interpretability techniques [ KWG+18 ] predict how models behave under counterfactual inputs. Analogously, in this work we study predictive data attribution (or datamodeling [ IPE+22 ]), where the goal is to predict how a model would behave if it had been tr ained on a different dataset. This well-studied problem encompasses, e.g., estimating the ef fect (on the resulting trained model's predictions) of modifying a training example [ KL17 ], removing a group of training examples [ KAT+19; BNL+22; PGI+23 ], or adding entire training data sources [ LSZ+24 ]. Predictive data attribution in large-scale settings is cha llenging: it requires simulating training a model on a different dataset without actually training [ GWP+23; IGE+24 ]. In "classical" settings--when learning corresponds to minimizing a convex loss--statistical tools like the influence function [ Ham47 ] allow us to accurately and efficiently estimate how differen t training data choices change trained model predictions [ RM18; KAT+19; GSL+19 ]. However, in the non-convex settings that are ubiquitous in natural domains like langua ge/vision, current methods are less effective. Indeed, the best existing methods produce estimat es that typically (a) only moderately correlate with the ground truth [ BPF21; BNL+22; PGI+23 ] and (b) incur large absolute error [ BNL+22 ].

artificial intelligence, inductive learning, machine learning, (17 more...)

arXiv.org Machine Learning

Apr-23-2025

arXiv.org PDF

Add feedback

Country:
- Asia > Middle East
  - Jordan (0.04)
- North America > United States
  - Massachusetts > Middlesex County > Cambridge (0.04)

Genre:
- Research Report (0.64)

Technology:
- Information Technology > Artificial Intelligence > Machine Learning
  - Inductive Learning (0.95)
  - Neural Networks > Deep Learning (0.83)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found