AITopics

2501.0579

Country:

Europe (0.93)
North America > United States > California (0.46)

Genre: Research Report > New Finding (1.00)

Industry:

Leisure & Entertainment > Sports > Baseball (1.00)
Health & Medicine > Therapeutic Area (1.00)
Health & Medicine > Consumer Health (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

arXiv.org Machine LearningDec-22-2024

Distributionally Robust Instrumental Variables Estimation

Qu, Zhaonan, Kwon, Yongchan

Instrumental variables (IV) estimation, also known as IV regression, is a fundamental method in econometrics and statistics to infer causal relationships in observational data with unobserved confounding. It leverages access to additional variables (instruments) that affect the outcome exogenously and exclusively through the endogenous regressor to yield consistent causal estimates, even when the standard ordinary least squares (OLS) estimator is biased by unobserved confounding (Imbens and Angrist, 1994; Angrist et al., 1996; Imbens and Rubin, 2015). Over the years, IV estimation has become an indispensable tool for causal inference in empirical works in economics (Card and Krueger, 1994), as well as in the study of genetic and epidemiological data (Davey Smith and Ebrahim, 2003). Despite the widespread use of IV in empirical and applied works, it has important limitations and challenges, such as invalid instruments (Sargan, 1958; Murray, 2006), weak instruments (Staiger and Stock, 1997), non-compliance (Imbens and Angrist, 1994), and heteroskedasticity, especially in settings with weak instruments or highly leveraged datasets (Andrews et al., 2019; Young, 2022). These issues could significantly impact the validity and quality of estimation and inference using instrumental variables (Jiang, 2017). Many works have since been devoted to assessing and addressing these issues, such as statistical tests (Hansen, 1982; Stock and Yogo, 2002), sensitivity analysis (Rosenbaum and Rubin, 1983; Bonhomme and Weidner, 2022), and additional assumptions or structures on the data generating process (Kolesár et al., 2015; Kang et al., 2016; Guo et al., 2018b). Recently, an emerging line of works have highlighted interesting connections between causality and the concepts of invariance and robustness (Peters et al., 2016; Meinshausen, 2018; Rothenhäusler et al., 2021; Bühlmann, 2020; Jakobsen and Peters, 2022; Fan et al., 2024). Their guiding philosophy is that causal properties can be viewed as robustness against changes across heterogeneous environments, represented by a set P of data distributions.

artificial intelligence, estimator, machine learning, (20 more...)

2410.15634

Country: North America > United States (0.92)

Genre: Research Report > New Finding (0.46)

Industry: Health & Medicine > Epidemiology (0.66)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Regression (0.93)
Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.93)
Information Technology > Artificial Intelligence > Representation & Reasoning > Mathematical & Statistical Methods (0.67)

arXiv.org Artificial IntelligenceNov-9-2024

A Survey on Data Markets

Zhang, Jiayao, Bi, Yuran, Cheng, Mengye, Liu, Jinfei, Ren, Kui, Sun, Qiheng, Wu, Yihang, Cao, Yang, Fernandez, Raul Castro, Xu, Haifeng, Jia, Ruoxi, Kwon, Yongchan, Pei, Jian, Wang, Jiachen T., Xia, Haocheng, Xiong, Li, Yu, Xiaohui, Zou, James

Data is the new oil of the 21st century. The growing trend of trading data for greater welfare has led to the emergence of data markets. A data market is any mechanism whereby the exchange of data products including datasets and data derivatives takes place as a result of data buyers and data sellers being in contact with one another, either directly or through mediating agents. It serves as a coordinating mechanism by which several functions, including the pricing and the distribution of data as the most important ones, interact to make the value of data fully exploited and enhanced. In this article, we present a comprehensive survey of this important and emerging direction from the aspects of data search, data productization, data transaction, data pricing, revenue allocation as well as privacy, security, and trust issues. We also investigate the government policies and industry status of data markets across different countries and different domains. Finally, we identify the unresolved challenges and discuss possible future directions for the development of data markets.

kamalika chaudhuri and ruslan salakhutdinov, knowledge management, machine learning, (31 more...)

2411.07267

Country:

Asia > China (1.00)
Africa (1.00)
Europe > France (0.67)
(6 more...)

Genre:

Overview (1.00)
Research Report > Experimental Study (0.45)
Research Report > New Finding (0.45)

Industry:

Telecommunications (1.00)
Law > Civil Rights & Constitutional Law (1.00)
Law Enforcement & Public Safety (1.00)
(13 more...)

Technology:

Information Technology > e-Commerce > Financial Technology (1.00)
Information Technology > Software Engineering (1.00)
Information Technology > Security & Privacy (1.00)
(20 more...)

arXiv.org Artificial IntelligenceOct-9-2024

Group Shapley Value and Counterfactual Simulations in a Structural Model

Kwon, Yongchan, Lee, Sokbae, Pouliot, Guillaume A.

We propose a variant of the Shapley value, the group Shapley value, to interpret counterfactual simulations in structural economic models by quantifying the importance of different components. Our framework compares two sets of parameters, partitioned into multiple groups, and applying group Shapley value decomposition yields unique additive contributions to the changes between these sets. The relative contributions sum to one, enabling us to generate an importance table that is as easily interpretable as a regression table. The group Shapley value can be characterized as the solution to a constrained weighted least squares problem. Using this property, we develop robust decomposition methods to address scenarios where inputs for the group Shapley value are missing. We first apply our methodology to a simple Roy model and then illustrate its usefulness by revisiting two published papers.

artificial intelligence, machine learning, shapley value, (16 more...)

2410.06875

Country:

North America > United States (0.46)
Asia (0.29)

Genre: Research Report (1.00)

Industry: Banking & Finance > Economy (0.46)

Technology:

Information Technology > Game Theory (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.46)

arXiv.org Artificial IntelligenceMay-28-2024

Truthful Dataset Valuation by Pointwise Mutual Information

Zheng, Shuran, Kwon, Yongchan, Qi, Xuan, Zou, James

A common way to evaluate a dataset in ML involves training a model on this dataset and assessing the model's performance on a test set. However, this approach has two issues: (1) it may incentivize undesirable data manipulation in data marketplaces, as the self-interested data providers seek to modify the dataset to maximize their evaluation scores; (2) it may select datasets that overfit to potentially small test sets. We propose a new data valuation method that provably guarantees the following: data providers always maximize their expected score by truthfully reporting their observed data. Any manipulation of the data, including but not limited to data duplication, adding random data, data removal, or re-weighting data from different groups, cannot increase their expected score. Our method, following the paradigm of proper scoring rules, measures the pointwise mutual information (PMI) of the test dataset and the evaluated dataset. However, computing the PMI of two datasets is challenging. We introduce a novel PMI measuring method that greatly improves tractability within Bayesian machine learning contexts. This is accomplished through a new characterization of PMI that relies solely on the posterior probabilities of the model parameter at an arbitrarily selected value. Finally, we support our theoretical results with simulations and further test the effectiveness of our data valuation method in identifying the top datasets among multiple data providers. Interestingly, our method outperforms the standard approach of selecting datasets based on the trained model's test performance, suggesting that our truthful valuation score can also be more robust to overfitting.

artificial intelligence, dataset, machine learning, (15 more...)

2405.18253

Country:

North America > United States (0.28)
Europe (0.28)

Genre: Research Report (0.83)

Industry: Leisure & Entertainment > Games (0.34)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty (0.67)

arXiv.org Machine LearningMay-6-2024

Rethinking Data Shapley for Data Selection Tasks: Misleads and Merits

Wang, Jiachen T., Yang, Tianji, Zou, James, Kwon, Yongchan, Jia, Ruoxi

Data Shapley provides a principled approach to data valuation and plays a crucial role in data-centric machine learning (ML) research. Data selection is considered a standard application of Data Shapley. However, its data selection performance has shown to be inconsistent across settings in the literature. This study aims to deepen our understanding of this phenomenon. We introduce a hypothesis testing framework and show that Data Shapley's performance can be no better than random selection without specific constraints on utility functions. We identify a class of utility functions, monotonically transformed modular functions, within which Data Shapley optimally selects data. Based on this insight, we propose a heuristic for predicting Data Shapley's effectiveness in data selection tasks. Our experiments corroborate these findings, adding new insights into when Data Shapley may or may not succeed.

artificial intelligence, data shapley, machine learning, (15 more...)

2405.03875

Country:

North America > United States (0.28)
Europe > Austria > Vienna (0.14)

Genre: Research Report > Experimental Study (0.66)

Industry: Health & Medicine (1.00)

Technology:

Information Technology > Data Science (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Game Theory (0.96)
(3 more...)

arXiv.org Artificial IntelligenceNov-22-2023

Data Acquisition: A New Frontier in Data-centric AI

Chen, Lingjiao, Acun, Bilge, Ardalani, Newsha, Sun, Yifan, Kang, Feiyang, Lyu, Hanrui, Kwon, Yongchan, Jia, Ruoxi, Wu, Carole-Jean, Zaharia, Matei, Zou, James

Datasets, the cornerstone of modern machine learning (ML) systems, have been increasingly sold and purchased for different ML pipelines [2]. Several data marketplaces have emerged to serve different stages of building ML-enhanced data applications. For example, NASDAQ Data Link [3] offers financial datasets cleaned and structured for model training, Amazon AWS data exchange [4] focuses on generic tabular datasets, and Databricks Marketplace [5] integrates raw datasets and ML pipelines to deliver insights. The data-as-a-service market size was more than 30 billions and is expected to double in the next five years [6]. While the data marketplaces are increasingly expanding, unfortunately, data acquisition for ML remains challenging, partially due to its ad-hoc nature: Based on discussions with real-world users, data acquirers often need to negotiate varying contracts with different data providers first, then purchase multiple datasets with different formats, and finally filtering out unnecessary data from the purchased datasets.

artificial intelligence, machine learning, provider, (16 more...)

2311.13712

Country:

North America > United States > California > Santa Clara County (0.14)
North America > United States > California > Alameda County > Berkeley (0.14)

Genre: Research Report (0.82)

Industry:

Health & Medicine (1.00)
Information Technology > Security & Privacy (0.88)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.46)

arXiv.org Machine LearningOct-13-2023

OpenDataVal: a Unified Benchmark for Data Valuation

Jiang, Kevin Fu, Liang, Weixin, Zou, James, Kwon, Yongchan

Assessing the quality and impact of individual data points is critical for improving model performance and mitigating undesirable biases within the training dataset. Several data valuation algorithms have been proposed to quantify data quality, however, there lacks a systemic and standardized benchmarking system for data valuation. In this paper, we introduce OpenDataVal, an easy-to-use and unified benchmark framework that empowers researchers and practitioners to apply and compare various data valuation algorithms. OpenDataVal provides an integrated environment that includes (i) a diverse collection of image, natural language, and tabular datasets, (ii) implementations of eleven different state-of-the-art data valuation algorithms, and (iii) a prediction model API that can import any models in scikit-learn. Furthermore, we propose four downstream machine learning tasks for evaluating the quality of data values. We perform benchmarking analysis using OpenDataVal, quantifying and comparing the efficacy of state-of-the-art data valuation approaches. We find that no single algorithm performs uniformly best across all tasks, and an appropriate algorithm should be employed for a user's downstream task. OpenDataVal is publicly available at https://opendataval.github.io with comprehensive documentation. Furthermore, we provide a leaderboard where researchers can evaluate the effectiveness of their own data valuation algorithms.

data valuation algorithm, machine learning, natural language, (16 more...)

2306.10577

Country:

North America > United States > New York (0.14)
South America > Brazil (0.14)

Genre: Research Report > Experimental Study (0.47)

Industry:

Information Technology > Security & Privacy (0.92)
Health & Medicine (0.67)

Technology:

Information Technology > Data Science (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

arXiv.org Machine LearningOct-2-2023

DataInf: Efficiently Estimating Data Influence in LoRA-tuned LLMs and Diffusion Models

Kwon, Yongchan, Wu, Eric, Wu, Kevin, Zou, James

Quantifying the impact of training data points is crucial for understanding the outputs of machine learning models and for improving the transparency of the AI pipeline. The influence function is a principled and popular data attribution method, but its computational cost often makes it challenging to use. This issue becomes more pronounced in the setting of large language models and text-to-image models. In this work, we propose DataInf, an efficient influence approximation method that is practical for large-scale generative AI models. Leveraging an easy-to-compute closed-form expression, DataInf outperforms existing influence computation algorithms in terms of computational and memory efficiency. Our theoretical analysis shows that DataInf is particularly well-suited for parameter-efficient fine-tuning techniques such as LoRA. Through systematic empirical evaluations, we show that DataInf accurately approximates influence scores and is orders of magnitude faster than existing methods. In applications to RoBERTa-large, Llama-2-13B-chat, and stable-diffusion-v1.5 models, DataInf effectively identifies the most influential fine-tuning examples better than other approximate influence scores. Moreover, it can help to identify which data points are mislabeled.

artificial intelligence, machine learning, natural language, (21 more...)

2310.00902

Country: Europe (0.14)

Genre: Research Report > New Finding (0.67)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

arXiv.org Artificial IntelligenceJun-1-2023

Data-OOB: Out-of-bag Estimate as a Simple and Efficient Data Value

Kwon, Yongchan, Zou, James

Data valuation is a powerful framework for providing statistical insights into which data are beneficial or detrimental to model training. Many Shapley-based data valuation methods have shown promising results in various downstream tasks, however, they are well known to be computationally challenging as it requires training a large number of models. As a result, it has been recognized as infeasible to apply to large datasets. To address this issue, we propose Data-OOB, a new data valuation method for a bagging model that utilizes the out-of-bag estimate. The proposed method is computationally efficient and can scale to millions of data by reusing trained weak learners. Specifically, Data-OOB takes less than 2.25 hours on a single CPU processor when there are $10^6$ samples to evaluate and the input dimension is 100. Furthermore, Data-OOB has solid theoretical interpretations in that it identifies the same important data point as the infinitesimal jackknife influence function when two different points are compared. We conduct comprehensive experiments using 12 classification datasets, each with thousands of sample sizes. We demonstrate that the proposed method significantly outperforms existing state-of-the-art data valuation methods in identifying mislabeled data and finding a set of helpful (or harmful) data points, highlighting the potential for applying data values in real-world applications.

artificial intelligence, dataset, machine learning, (13 more...)

2304.07718

Country: North America > United States > Hawaii (0.14)

Genre:

Research Report > New Finding (0.68)
Research Report > Experimental Study (0.46)

Industry: Education (0.93)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Data Science (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis (0.67)