AITopics | datamodel

Collaborating Authors

datamodel

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

DataMIL: Selecting Data for Robot Imitation Learning with Datamodels

Dass, Shivin, Khaddaj, Alaa, Engstrom, Logan, Madry, Aleksander, Ilyas, Andrew, Martín-Martín, Roberto

arXiv.org Artificial IntelligenceMay-15-2025

Recently, the robotics community has amassed ever larger and more diverse datasets to train generalist robot policies. However, while these policies achieve strong mean performance across a variety of tasks, they often underperform on individual, specialized tasks and require further tuning on newly acquired task-specific data. Combining task-specific data with carefully curated subsets of large prior datasets via co-training can produce better specialized policies, but selecting data naively may actually harm downstream performance. To address this, we introduce DataMIL, a policy-driven data selection framework built on the datamodels paradigm that reasons about data selection in an end-to-end manner, using the policy itself to identify which data points will most improve performance. Unlike standard practices that filter data using human notions of quality (e.g., based on semantic or visual similarity), DataMIL directly optimizes data selection for task success, allowing us to select data that enhance the policy while dropping data that degrade it. To avoid performing expensive rollouts in the environment during selection, we use a novel surrogate loss function on task-specific data, allowing us to use DataMIL in the real world without degrading performance. We validate our approach on a suite of more than 60 simulation and real-world manipulation tasks - most notably showing successful data selection from the Open X-Embodiment datasets-demonstrating consistent gains in success rates and superior performance over multiple baselines. Our results underscore the importance of end-to-end, performance-aware data selection for unlocking the potential of large prior datasets in robotics. More information at https://robin-lab.cs.utexas.edu/datamodels4imitation/

artificial intelligence, dataset, machine learning, (14 more...)

arXiv.org Artificial Intelligence

2505.09603

Genre: Research Report > New Finding (1.00)

Technology:

Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

Attribute-to-Delete: Machine Unlearning via Datamodel Matching

Georgiev, Kristian, Rinberg, Roy, Park, Sung Min, Garg, Shivam, Ilyas, Andrew, Madry, Aleksander, Neel, Seth

arXiv.org Artificial IntelligenceNov-11-2024

Machine unlearning -- efficiently removing the effect of a small "forget set" of training data on a pre-trained machine learning model -- has recently attracted significant research interest. Despite this interest, however, recent work shows that existing machine unlearning techniques do not hold up to thorough evaluation in non-convex settings. In this work, we introduce a new machine unlearning technique that exhibits strong empirical performance even in such challenging settings. Our starting point is the perspective that the goal of unlearning is to produce a model whose outputs are statistically indistinguishable from those of a model re-trained on all but the forget set. This perspective naturally suggests a reduction from the unlearning problem to that of data attribution, where the goal is to predict the effect of changing the training set on a model's outputs. Thus motivated, we propose the following meta-algorithm, which we call Datamodel Matching (DMM): given a trained model, we (a) use data attribution to predict the output of the model if it were re-trained on all but the forget set points; then (b) fine-tune the pre-trained model to match these predicted outputs. In a simple convex setting, we show how this approach provably outperforms a variety of iterative unlearning algorithms. Empirically, we use a combination of existing evaluations and a new metric based on the KL-divergence to show that even in non-convex settings, DMM achieves strong unlearning performance relative to existing algorithms. An added benefit of DMM is that it is a meta-algorithm, in the sense that future advances in data attribution translate directly into better unlearning algorithms, pointing to a clear direction for future progress in unlearning.

algorithm, datamodel, gradient descent, (15 more...)

arXiv.org Artificial Intelligence

2410.23232

Country: North America > United States > California > Santa Clara County > Palo Alto (0.04)

Genre: Research Report (0.82)

Industry: Information Technology > Security & Privacy (0.67)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.47)

Add feedback

DsDm: Model-Aware Dataset Selection with Datamodels

Engstrom, Logan, Feldmann, Axel, Madry, Aleksander

arXiv.org Artificial IntelligenceJan-23-2024

When selecting data for training large-scale models, standard practice is to filter for examples that match human notions of data quality. Such filtering yields qualitatively clean datapoints that intuitively should improve model behavior. However, in practice the opposite can often happen: we find that selecting according to similarity with "high quality" data sources may not increase (and can even hurt) performance compared to randomly selecting data. To develop better methods for selecting data, we start by framing dataset selection as an optimization problem that we can directly solve for: given target tasks, a learning algorithm, and candidate data, select the subset that maximizes model performance. This framework thus avoids handpicked notions of data quality, and instead models explicitly how the learning process uses train datapoints to predict on the target tasks. Our resulting method greatly improves language model (LM) performance on both pre-specified tasks and previously unseen tasks. Specifically, choosing target tasks representative of standard LM problems and evaluating on diverse held-out benchmarks, our selected datasets provide a 2x compute multiplier over baseline methods.

datamodel, target task, train sample, (14 more...)

arXiv.org Artificial Intelligence

2401.12926

Country:

North America > United States > New York > Albany County > Albany (0.14)
Europe > Ireland (0.05)
Europe > Russia (0.04)
(67 more...)

Genre: Research Report > New Finding (0.68)

Industry:

Leisure & Entertainment (1.00)
Education (1.00)
Media (0.92)
(4 more...)

Technology:

Information Technology > Data Science (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.93)
(2 more...)

Add feedback

A Simple and Efficient Baseline for Data Attribution on Images

Singla, Vasu, Sandoval-Segura, Pedro, Goldblum, Micah, Geiping, Jonas, Goldstein, Tom

arXiv.org Artificial IntelligenceNov-3-2023

Data attribution methods play a crucial role in understanding machine learning models, providing insight into which training data points are most responsible for model outputs during deployment. However, current state-of-the-art approaches require a large ensemble of as many as 300,000 models to accurately attribute model predictions. These approaches therefore come at a high computational cost, are memory intensive, and are hard to scale to large models or datasets. In this work, we focus on a minimalist baseline, utilizing the feature space of a backbone pretrained via self-supervised learning to perform data attribution. Our method is model-agnostic and scales easily to large datasets. We show results on CIFAR-10 and ImageNet, achieving strong performance that rivals or outperforms state-of-the-art approaches at a fraction of the compute or memory cost. Contrary to prior work, our results reinforce the intuition that a model's prediction on one image is most impacted by visually similar training samples. Our approach serves as a simple and efficient baseline for data attribution on images.

attribution method, datamodel, training sample, (14 more...)

arXiv.org Artificial Intelligence

2311.03386

Country:

Europe > Germany > Baden-Württemberg > Tübingen Region > Tübingen (0.04)
North America > United States > New York (0.04)
North America > United States > Maryland (0.04)
Europe > France (0.04)

Genre: Research Report > New Finding (0.86)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.93)

Add feedback

Rethinking Backdoor Attacks

Khaddaj, Alaa, Leclerc, Guillaume, Makelov, Aleksandar, Georgiev, Kristian, Salman, Hadi, Ilyas, Andrew, Madry, Aleksander

arXiv.org Artificial IntelligenceJul-19-2023

In a backdoor attack, an adversary inserts maliciously constructed backdoor examples into a training set to make the resulting model vulnerable to manipulation. Defending against such attacks typically involves viewing these inserted examples as outliers in the training set and using techniques from robust statistics to detect and remove them. In this work, we present a different approach to the backdoor attack problem. Specifically, we show that without structural information about the training data distribution, backdoor attacks are indistinguishable from naturally-occurring features in the data--and thus impossible to "detect" in a general sense. Then, guided by this observation, we revisit existing defenses against backdoor attacks and characterize the (often latent) assumptions they make and on which they depend. Finally, we explore an alternative perspective on backdoor attacks: one that assumes these attacks correspond to the strongest feature in the training data. Under this assumption (which we make formal) we develop a new primitive for detecting backdoor attacks. Our primitive naturally gives rise to a detection algorithm that comes with theoretical guarantees and is effective in practice.

artificial intelligence, backdoor attack, machine learning, (19 more...)

arXiv.org Artificial Intelligence

2307.10163

Country: North America > United States > Massachusetts > Middlesex County > Cambridge (0.05)

Genre: Research Report (0.64)

Industry: Information Technology > Security & Privacy (1.00)

Technology:

Information Technology > Security & Privacy (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

ModelDiff: A Framework for Comparing Learning Algorithms

Shah, Harshay, Park, Sung Min, Ilyas, Andrew, Madry, Aleksander

arXiv.org Artificial IntelligenceNov-22-2022

We study the problem of (learning) algorithm comparison, where the goal is to find differences between models trained with two different learning algorithms. We begin by formalizing this goal as one of finding distinguishing feature transformations, i.e., input transformations that change the predictions of models trained with one learning algorithm but not the other. We then present ModelDiff, a method that leverages the datamodels framework (Ilyas et al., 2022) to compare learning algorithms based on how they use their training data. We demonstrate ModelDiff through three case studies, comparing models trained with/without data augmentation, with/without pre-training, and with different SGD hyperparameters. Our code is available at https://github.com/MadryLab/modeldiff .

algorithm, artificial intelligence, machine learning, (16 more...)

arXiv.org Artificial Intelligence

2211.12491

Country:

North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
Europe > France (0.04)

Genre: Research Report > New Finding (0.93)

Industry:

Government > Military (0.46)
Government > Regional Government > North America Government > United States Government (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.92)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.68)

Add feedback

Understanding Influence Functions and Datamodels via Harmonic Analysis

Saunshi, Nikunj, Gupta, Arushi, Braverman, Mark, Arora, Sanjeev

arXiv.org Artificial IntelligenceOct-3-2022

It is often of great interest to quantify how the presence or absence of a particular training data point affects the trained model's performance on test data points. Influence functions is a classical idea for this [Jaeckel, 1972, Hampel, 1974, Cook, 1977] that has recently been adapted to modern deep models and large datasets Koh and Liang [2017]. Influence functions have been applied to explain predictions and produce confidence intervals [Schulam and Saria, 2019], investigate model bias [Brunet et al., 2019, Wang et al., 2019], estimate Shapley values [Jia et al., 2019, Ghorbani and Zou, 2019], improve human trust [Zhou et al., 2019], and craft data poisoning attacks [Koh et al., 2019]. Influence actually has different formalizations. The classic calculus-based estimate (henceforth referred to as continuous influence) involves conceptualizing training loss as a weighted sum over training datapoints, where the weighting of a particular datapoint z can be varied infinitesimally.

artificial intelligence, datamodel, machine learning, (18 more...)

arXiv.org Artificial Intelligence

2210.01072

Country:

Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
Europe > France (0.04)

Genre: Research Report > New Finding (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.68)

Add feedback

Predicting Predictions with Datamodels

#artificialintelligenceFeb-8-2022, 20:23:24 GMT

This question is rarely an easy one to answer. On one hand, we know that predictions are a product of training data and learning algorithms. On the other hand, it is often hard to characterize exactly how these two elements interact. In our latest work, we introduce datamodels--a step towards acquiring a more fine-grained understanding of how learning algorithms use training data to make predictions. This post introduces the datamodeling framework, describes its simplest, linear instantiation, and illustrates its success in modeling data-to-prediction mapping for deep neural networks.

algorithm, datamodel, prediction, (13 more...)

#artificialintelligence

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.61)

Add feedback

Datamodels: Predicting Predictions from Training Data

Ilyas, Andrew, Park, Sung Min, Engstrom, Logan, Leclerc, Guillaume, Madry, Aleksander

arXiv.org Machine LearningFeb-1-2022

We present a conceptual framework, datamodeling, for analyzing the behavior of a model class in terms of the training data. For any fixed "target" example $x$, training set $S$, and learning algorithm, a datamodel is a parameterized function $2^S \to \mathbb{R}$ that for any subset of $S' \subset S$ -- using only information about which examples of $S$ are contained in $S'$ -- predicts the outcome of training a model on $S'$ and evaluating on $x$. Despite the potential complexity of the underlying process being approximated (e.g., end-to-end training and evaluation of deep neural networks), we show that even simple linear datamodels can successfully predict model outputs. We then demonstrate that datamodels give rise to a variety of applications, such as: accurately predicting the effect of dataset counterfactuals; identifying brittle predictions; finding semantically similar examples; quantifying train-test leakage; and embedding data into a well-behaved and feature-rich representation space. Data for this paper (including pre-computed datamodels as well as raw predictions from four million trained deep neural networks) is available at https://github.com/MadryLab/datamodels-data .

datamodel, prediction, training example, (15 more...)

arXiv.org Machine Learning

2202.00622

Country:

Europe > Austria > Vienna (0.14)
North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
Europe > France (0.04)

Genre: Research Report > New Finding (0.92)

Industry: Transportation > Ground (0.46)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback