AITopics | knn shapley

Collaborating Authors

knn shapley

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Targeted synthetic data generation for tabular data via hardness characterization

Ferracci, Tommaso, Goldmann, Leonie Tabea, Hinel, Anton, Passino, Francesco Sanna

arXiv.org Machine LearningOct-1-2024

Synthetic data generation has been proven successful in improving model performance and robustness in the context of scarce or low-quality data. Using the data valuation framework to statistically identify beneficial and detrimental observations, we introduce a novel augmentation pipeline that generates only highvalue training points based on hardness characterization. We first demonstrate via benchmarks on real data that Shapley-based data valuation methods perform comparably with learning-based methods in hardness characterisation tasks, while offering significant theoretical and computational advantages. Then, we show that synthetic data generators trained on the hardest points outperform non-targeted data augmentation on simulated data and on a large scale credit default prediction task. In particular, our approach improves the quality of out-of-sample predictions and it is computationally more efficient compared to non-targeted methods. Training complex machine learning models requires large amounts of data, but in real-world applications data may be of poor quality, insufficient in amount, or subject to privacy, safety, and regulatory limitations. Such challenges have sparked an interest in synthetic data generation (SDG), representing the practice of using available data to generate realistic synthetic samples (Lu et al., 2024). In this work, we argue that, when the objective is to use synthetic data to make an existing machine learning model better generalize to unseen data, augmenting only the hardest training points is more effective than augmenting the entire training dataset.

dataset, knn shapley, shapley, (14 more...)

arXiv.org Machine Learning

2410.00759

Country: Europe > France (0.04)

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

Add feedback

Data-OOB: Out-of-bag Estimate as a Simple and Efficient Data Value

Kwon, Yongchan, Zou, James

arXiv.org Artificial IntelligenceJun-1-2023

Data valuation is a powerful framework for providing statistical insights into which data are beneficial or detrimental to model training. Many Shapley-based data valuation methods have shown promising results in various downstream tasks, however, they are well known to be computationally challenging as it requires training a large number of models. As a result, it has been recognized as infeasible to apply to large datasets. To address this issue, we propose Data-OOB, a new data valuation method for a bagging model that utilizes the out-of-bag estimate. The proposed method is computationally efficient and can scale to millions of data by reusing trained weak learners. Specifically, Data-OOB takes less than 2.25 hours on a single CPU processor when there are $10^6$ samples to evaluate and the input dimension is 100. Furthermore, Data-OOB has solid theoretical interpretations in that it identifies the same important data point as the infinitesimal jackknife influence function when two different points are compared. We conduct comprehensive experiments using 12 classification datasets, each with thousands of sample sizes. We demonstrate that the proposed method significantly outperforms existing state-of-the-art data valuation methods in identifying mislabeled data and finding a set of helpful (or harmful) data points, highlighting the potential for applying data values in real-world applications.

artificial intelligence, dataset, machine learning, (13 more...)

arXiv.org Artificial Intelligence

2304.07718

Country:

North America > United States > New York (0.04)
North America > United States > Hawaii > Honolulu County > Honolulu (0.04)
North America > United States > California > Santa Clara County > Palo Alto (0.04)

Genre:

Research Report > New Finding (0.68)
Research Report > Experimental Study (0.46)

Industry: Education (0.93)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Data Science (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis (0.67)

Add feedback