AITopics | data shapley

Collaborating Authors

data shapley

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

bdd5522a32b3a959a6d81fb6ddc1cb38-Paper-Conference.pdf

Neural Information Processing SystemsFeb-16-2026, 20:53:34 GMT

data mining, machine learning, val, (17 more...)

Neural Information Processing Systems

Country:

North America > United States > Virginia (0.04)
Europe > France (0.04)
North America > Canada > Nova Scotia > Halifax Regional Municipality > Halifax (0.04)
(4 more...)

Genre:

Overview (0.67)
Research Report > New Finding (0.67)

Industry:

Information Technology > Security & Privacy (1.00)
Health & Medicine (1.00)
Government > Regional Government > North America Government > United States Government (0.45)

Technology:

Information Technology > Security & Privacy (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (0.93)
(4 more...)

Add feedback

bdd5522a32b3a959a6d81fb6ddc1cb38-Paper-Conference.pdf

Neural Information Processing SystemsOct-9-2025, 06:14:04 GMT

data mining, machine learning, val, (18 more...)

Neural Information Processing Systems

Country:

North America > United States > Virginia (0.04)
Europe > France (0.04)
North America > Canada > Nova Scotia > Halifax Regional Municipality > Halifax (0.04)
(4 more...)

Genre:

Overview (0.67)
Research Report > New Finding (0.67)

Industry:

Information Technology > Security & Privacy (1.00)
Health & Medicine (1.00)
Government > Regional Government > North America Government > United States Government (0.45)

Technology:

Information Technology > Security & Privacy (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (0.93)
(3 more...)

Add feedback

Do Data Valuations Make Good Data Prices?

Fan, Dongyang, Rotello, Tyler J., Karimireddy, Sai Praneeth

arXiv.org Artificial IntelligenceSep-29-2025

As large language models increasingly rely on external data sources, compensating data contributors has become a central concern. But how should these payments be devised? We revisit data valuations from a $\textit{market-design perspective}$ where payments serve to compensate data owners for the $\textit{private}$ heterogeneous costs they incur for collecting and sharing data. We show that popular valuation methods-such as Leave-One-Out and Data Shapley-make for poor payments. They fail to ensure truthful reporting of the costs, leading to $\textit{inefficient market}$ outcomes. To address this, we adapt well-established payment rules from mechanism design, namely Myerson and Vickrey-Clarke-Groves (VCG), to the data market setting. We show that Myerson payment is the minimal truthful mechanism, optimal from the buyer's perspective. Additionally, we identify a condition under which both data buyers and sellers are utility-satisfied, and the market achieves efficiency. Our findings highlight the importance of incorporating incentive compatibility into data valuation design, paving the way for more robust and efficient data markets. Our data market framework is readily applicable to real-world scenarios. We illustrate this with simulations of contributor compensation in an LLM based retrieval-augmented generation (RAG) marketplace tasked with challenging medical question answering.

large language model, machine learning, payment, (21 more...)

arXiv.org Artificial Intelligence

2504.05563

Country: North America > United States (0.28)

Genre: Research Report (0.84)

Industry:

Law (0.46)
Health & Medicine (0.34)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.88)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.48)

Add feedback

Capturing the Temporal Dependence of Training Data Influence

Wang, Jiachen T., Song, Dawn, Zou, James, Mittal, Prateek, Jia, Ruoxi

arXiv.org Machine LearningDec-12-2024

Traditional data influence estimation methods, like influence function, assume that learning algorithms are permutation-invariant with respect to training data. However, modern training paradigms, especially for foundation models using stochastic algorithms and multi-stage curricula, are sensitive to data ordering, thus violating this assumption. This mismatch renders influence functions inadequate for answering a critical question in machine learning: How can we capture the dependence of data influence on the optimization trajectory during training? To address this gap, we formalize the concept of trajectory-specific leave-one-out (LOO) influence, which quantifies the impact of removing a data point from a specific iteration during training, accounting for the exact sequence of data encountered and the model's optimization trajectory. However, exactly evaluating the trajectory-specific LOO presents a significant computational challenge. To address this, we propose data value embedding, a novel technique enabling efficient approximation of trajectory-specific LOO. Specifically, we compute a training data embedding that encapsulates the cumulative interactions between data and the evolving model parameters. The LOO can then be efficiently approximated through a simple dot-product between the data value embedding and the gradient of the given test data. As data value embedding captures training data ordering, it offers valuable insights into model training dynamics. In particular, we uncover distinct phases of data influence, revealing that data points in the early and late stages of training exert a greater impact on the final model. These insights translate into actionable strategies for managing the computational overhead of data selection by strategically timing the selection process, potentially opening new avenues in data curation research.

influence function, iteration, training data, (15 more...)

arXiv.org Machine Learning

2412.09538

Country:

North America > United States > Virginia (0.04)
Europe > France (0.04)

Genre:

Research Report > New Finding (1.00)
Research Report > Promising Solution (0.65)

Industry:

Media (1.00)
Leisure & Entertainment (1.00)
Information Technology > Security & Privacy (1.00)
(3 more...)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.93)

Add feedback

Towards Data Valuation via Asymmetric Data Shapley

Zheng, Xi, Chang, Xiangyu, Jia, Ruoxi, Tan, Yong

arXiv.org Artificial IntelligenceNov-20-2024

Data valuation, which measures the contribution of individual data source on machine learning (ML) model performance, plays a crucial role in improving algorithmic transparency and creating incentive mechanisms for data sharing and monetization (Liu et al., 2023). Its importance is particularly evident in sectors like healthcare and finance, where explainable ML is increasingly being adopted for high-stake decision-making (Sahoh and Choksuriwong, 2023). The recent rise of data marketplaces further highlights the need for accurate data valuation (Ghorbani and Zou, 2019; Jia et al., 2019a). By integrating diverse data sources, these marketplaces enhance ML tasks and unlock significant business values (Agarwal et al., 2019). Fair compensation for data creators based on the value of their data is crucial in such contexts, making the equitable valuation of data a key issue (Altman, 2023). Data Shapley has recently gained widespread recognition for quantifying the contribution of individual data points to ML models (Ghorbani and Zou, 2019; Jia et al., 2019b). It is uniquely defined by four axioms (see Axiom 2.1-2.4 in Section 2).

data shapley, dataset, training dataset, (13 more...)

arXiv.org Artificial Intelligence

2411.00388

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
Europe > France (0.04)
North America > United States > Virginia (0.04)
(2 more...)

Genre: Research Report > New Finding (0.67)

Industry: Health & Medicine (0.48)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.93)

Add feedback

Targeted synthetic data generation for tabular data via hardness characterization

Ferracci, Tommaso, Goldmann, Leonie Tabea, Hinel, Anton, Passino, Francesco Sanna

arXiv.org Machine LearningOct-1-2024

Synthetic data generation has been proven successful in improving model performance and robustness in the context of scarce or low-quality data. Using the data valuation framework to statistically identify beneficial and detrimental observations, we introduce a novel augmentation pipeline that generates only highvalue training points based on hardness characterization. We first demonstrate via benchmarks on real data that Shapley-based data valuation methods perform comparably with learning-based methods in hardness characterisation tasks, while offering significant theoretical and computational advantages. Then, we show that synthetic data generators trained on the hardest points outperform non-targeted data augmentation on simulated data and on a large scale credit default prediction task. In particular, our approach improves the quality of out-of-sample predictions and it is computationally more efficient compared to non-targeted methods. Training complex machine learning models requires large amounts of data, but in real-world applications data may be of poor quality, insufficient in amount, or subject to privacy, safety, and regulatory limitations. Such challenges have sparked an interest in synthetic data generation (SDG), representing the practice of using available data to generate realistic synthetic samples (Lu et al., 2024). In this work, we argue that, when the objective is to use synthetic data to make an existing machine learning model better generalize to unseen data, augmenting only the hardest training points is more effective than augmenting the entire training dataset.

dataset, knn shapley, shapley, (14 more...)

arXiv.org Machine Learning

2410.00759

Country: Europe > France (0.04)

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

Add feedback

Uncertainty Quantification of Data Shapley via Statistical Inference

Wu, Mengmeng, Liu, Zhihong, Li, Xiang, Jia, Ruoxi, Chang, Xiangyu

arXiv.org Machine LearningJul-27-2024

As data plays an increasingly pivotal role in decision-making, the emergence of data markets underscores the growing importance of data valuation. Within the machine learning landscape, Data Shapley stands out as a widely embraced method for data valuation. However, a limitation of Data Shapley is its assumption of a fixed dataset, contrasting with the dynamic nature of real-world applications where data constantly evolves and expands. This paper establishes the relationship between Data Shapley and infinite-order U-statistics and addresses this limitation by quantifying the uncertainty of Data Shapley with changes in data distribution from the perspective of U-statistics. We make statistical inferences on data valuation to obtain confidence intervals for the estimations. We construct two different algorithms to estimate this uncertainty and provide recommendations for their applicable situations. We also conduct a series of experiments on various datasets to verify asymptotic normality and propose a practical trading scenario enabled by this method.

confidence interval, data shapley, dataset, (15 more...)

arXiv.org Machine Learning

2407.19373

Country:

Asia > China > Shaanxi Province > Xi'an (0.04)
North America > United States > Pennsylvania (0.04)
Europe > France (0.04)
(3 more...)

Genre:

Research Report > New Finding (1.00)
Overview (0.87)

Industry: Banking & Finance (0.34)

Technology:

Information Technology > Data Science (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Game Theory (0.96)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.67)

Add feedback

Data Shapley in One Training Run

Wang, Jiachen T., Mittal, Prateek, Song, Dawn, Jia, Ruoxi

arXiv.org Machine LearningJun-29-2024

Data Shapley provides a principled framework for attributing data's contribution within machine learning contexts. However, existing approaches require re-training models on different data subsets, which is computationally intensive, foreclosing their application to large-scale models. Furthermore, they produce the same attribution score for any models produced by running the learning algorithm, meaning they cannot perform targeted attribution towards a specific model obtained from a single run of the algorithm. This paper introduces In-Run Data Shapley, which addresses these limitations by offering scalable data attribution for a target model of interest. In its most efficient implementation, our technique incurs negligible additional runtime compared to standard model training. This dramatic efficiency improvement makes it possible to perform data attribution for the foundation model pretraining stage for the first time. We present several case studies that offer fresh insights into pretraining data's contribution and discuss their implications for copyright in generative AI and pretraining data curation.

data shapley, in-run data shapley, val, (15 more...)

arXiv.org Machine Learning

2406.11011

Country:

North America > United States > California > Los Angeles County > Los Angeles (0.04)
North America > United States > Virginia (0.04)
North America > United States > California > San Diego County > San Diego (0.04)
(3 more...)

Genre: Research Report > Promising Solution (0.46)

Industry:

Leisure & Entertainment (1.00)
Law (1.00)
Health & Medicine > Therapeutic Area > Immunology (1.00)
(4 more...)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning > Generative AI (0.34)

Add feedback

Mitigating federated learning contribution allocation instability through randomized aggregation

Geimer, Arno, Fiz, Beltran, State, Radu

arXiv.org Artificial IntelligenceMay-13-2024

Federated learning (FL) is a novel collaborative machine learning framework designed to preserve privacy while enabling the creation of robust models. This paradigm addresses a growing need for data security by allowing multiple participants to contribute to a model without exposing their individual datasets. A pivotal issue within this framework, however, concerns the fair and accurate attribution of contributions from various participants to the creation of the joint global model. Incorrect contribution distribution can erode trust among participants, result in inequitable compensation, and ultimately diminish the willingness of parties to engage or actively contribute to the federation. While several methods for remunerating participants have been proposed, little attention was given to the analysis of the stability of these methods when evaluating contributions, which is critical to ensure the long-term viability and fairness of FL systems. In this paper, we analyse this stability through the calculation of contributions by gradient-based model reconstruction techniques with Shapley values. Our investigation reveals that Shapley values fail to reflect baseline contributions, especially when employing different aggregation techniques. To address this issue, we extend on established aggregation techniques by introducing FedRandom, which is designed to sample contributions in a more equitable and distributed manner. We demonstrate that this approach not only serves as a viable aggregation technique but also significantly improves the accuracy of contribution assessment compared to traditional methods. Our results suggest that FedRandom enhances the overall fairness and stability of the federated learning system, making it a superior choice for federations with limited number of participants.

aggregation strategy, contribution, participant, (12 more...)

arXiv.org Artificial Intelligence

2405.08044

Country: Europe > France (0.04)

Genre: Research Report > New Finding (1.00)

Industry: Information Technology > Security & Privacy (0.88)

Technology:

Information Technology > Game Theory (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.46)

Add feedback

Data Valuation with Gradient Similarity

Evans, Nathaniel J., Mills, Gordon B., Wu, Guanming, Song, Xubo, McWeeney, Shannon

arXiv.org Machine LearningMay-13-2024

High-quality data is crucial for accurate machine learning and actionable analytics, however, mislabeled or noisy data is a common problem in many domains. Distinguishing low- from high-quality data can be challenging, often requiring expert knowledge and considerable manual intervention. Data Valuation algorithms are a class of methods that seek to quantify the value of each sample in a dataset based on its contribution or importance to a given predictive task. These data values have shown an impressive ability to identify mislabeled observations, and filtering low-value data can boost machine learning performance. In this work, we present a simple alternative to existing methods, termed Data Valuation with Gradient Similarity (DVGS). This approach can be easily applied to any gradient descent learning algorithm, scales well to large datasets, and performs comparably or better than baseline valuation methods for tasks such as corrupted label discovery and noise quantification. We evaluate the DVGS method on tabular, image and RNA expression datasets to show the effectiveness of the method across domains. Our approach has the ability to rapidly and accurately identify low-quality data, which can reduce the need for expert knowledge and manual intervention in data cleaning tasks.

data valuation, dataset, gradient, (13 more...)

arXiv.org Machine Learning

2405.08217

Country:

North America > United States > Oregon > Multnomah County > Portland (0.14)
North America > United States > Ohio > Lucas County > Oregon (0.04)
North America > United States > Washington > King County > Bellevue (0.04)
(2 more...)

Genre: Research Report (1.00)

Industry:

Health & Medicine > Therapeutic Area (0.68)
Health & Medicine > Pharmaceuticals & Biotechnology (0.46)

Technology:

Information Technology > Data Science > Data Quality (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.66)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.47)

Add feedback