data marketplace
DataPerf: Benchmarks for Data-Centric AI Development Mark Mazumder
Machine learning research has long focused on models rather than datasets, and prominent datasets are used for common ML tasks without regard to the breadth, difficulty, and faithfulness of the underlying problems. Neglecting the fundamental importance of data has given rise to inaccuracy, bias, and fragility in real-world applications, and research is hindered by saturation across existing dataset benchmarks.
- North America > United States > California > San Diego County > San Diego (0.04)
- Europe > Switzerland > Zürich > Zürich (0.04)
- North America > Canada (0.04)
- (2 more...)
Designing Reputation Systems for Manufacturing Data Trading Markets: A Multi-Agent Evaluation with Q-Learning and IRL-Estimated Utilities
Yamamoto, Kenta, Hayashi, Teruaki
Abstract--Recent advances in machine learning and big data analytics have intensified the demand for high-quality cross-domain datasets and accelerated the growth of data trading across organizations. As data become increasingly recognized as an economic asset, data marketplaces have emerged as a key infrastructure for data-driven innovation. However, unlike mature product or service markets, data-trading environments remain nascent and suffer from pronounced information asymmetry. Buyers cannot verify the content or quality before purchasing data, making trust and quality assurance central challenges. T o address these issues, this study develops a multi-agent data-market simulator that models participant behavior and evaluates the institutional mechanisms for trust formation. Focusing on the manufacturing sector, where initiatives such as GAIA-X and Catena-X are advancing, the simulator integrates reinforcement learning (RL) for adaptive agent behavior and inverse reinforcement learning (IRL) to estimate utility functions from empirical behavioral data. Using the simulator, we examine the market-level effects of five representative reputation systems--Time-decay, Bayesian-beta, PageRank, PowerTrust, and PeerTrust--and found that PeerTrust achieved the strongest alignment between data price and quality, while preventing monopolistic dominance. Building on these results, we develop a hybrid reputation mechanism that integrates the strengths of existing systems to achieve improved price-quality consistency and overall market stability. This study extends simulation-based data-market analysis by incorporating trust and reputation as endogenous mechanisms and offering methodological and institutional insights into the design of reliable and efficient data ecosystems.
- Asia > Japan > Honshū > Kantō > Tokyo Metropolis Prefecture > Tokyo (0.14)
- Europe > Netherlands > North Holland > Amsterdam (0.04)
- Europe > Germany (0.04)
- Asia > Middle East > Republic of Türkiye (0.04)
- Research Report > Experimental Study (1.00)
- Research Report > New Finding (0.93)
- Banking & Finance > Trading (1.00)
- Information Technology (0.93)
LLM-based Multi-Agent System for Simulating Strategic and Goal-Oriented Data Marketplaces
Sashihara, Jun, Fujita, Yukihisa, Nakamura, Kota, Kuwahara, Masahiro, Hayashi, Teruaki
Abstract--Data marketplaces, which mediate the purchase and exchange of data from third parties, have attracted growing attention for reducing the cost and effort of data collection while enabling the trading of diverse datasets. However, a systematic understanding of the interactions between market participants, data, and regulations remains limited. T o address this gap, we propose a Large Language Model-based Multi-Agent System (LLM-MAS) for data marketplaces. In our framework, buyer and seller agents powered by LLMs operate with explicit objectives and autonomously perform strategic actions, such as--planning, searching, purchasing, pricing, and updating data. These agents can reason about market dynamics, forecast future demand, and adapt their strategies accordingly. Unlike conventional model-based simulations, which are typically constrained to predefined rules, LLM-MAS supports broader and more adaptive behavior selection through natural language reasoning. We evaluated the framework via simulation experiments using three distribution-based metrics: (1) the number of purchases per dataset, (2) the number of purchases per buyer, and (3) the number of repeated purchases of the same dataset. The results demonstrate that LLM-MAS more faithfully reproduces trading patterns observed in real data marketplaces compared to traditional approaches, and further captures the emergence and evolution of market trends. Data have emerged as a tradable economic resource, and data marketplaces that mediate the purchase and exchange of datasets from third parties have rapidly expanded [1]. These marketplaces streamline data collection that previously required substantial cost and effort, while also providing organizations and researchers with access to diverse, high-quality datasets. As a result, they are increasingly recognized as critical infrastructures that accelerate innovation based on data that were closed within individual organizations [2]. Despite this progress, our understanding of how interactions among market participants, data, and regulations shape market dynamics remains limited. Smooth and efficient data transactions require well-designed and robust data marketplaces [3].
- North America > United States (0.15)
- Asia > Japan > Honshū > Kantō > Tokyo Metropolis Prefecture > Tokyo (0.15)
- Asia > Vietnam (0.04)
A Cramér-von Mises Approach to Incentivizing Truthful Data Sharing
Clinton, Alex, Zeng, Thomas, Chen, Yiding, Zhu, Xiaojin, Kandasamy, Kirthevasan
Modern data marketplaces and data sharing consortia increasingly rely on incentive mechanisms to encourage agents to contribute data. However, schemes that reward agents based on the quantity of submitted data are vulnerable to manipulation, as agents may submit fabricated or low-quality data to inflate their rewards. Prior work has proposed comparing each agent's data against others' to promote honesty: when others contribute genuine data, the best way to minimize discrepancy is to do the same. Yet prior implementations of this idea rely on very strong assumptions about the data distribution (e.g. Gaussian), limiting their applicability. In this work, we develop reward mechanisms based on a novel, two-sample test inspired by the Cramér-von Mises statistic. Our methods strictly incentivize agents to submit more genuine data, while disincentivizing data fabrication and other types of untruthful reporting. We establish that truthful reporting constitutes a (possibly approximate) Nash equilibrium in both Bayesian and prior-agnostic settings. We theoretically instantiate our method in three canonical data sharing problems and show that it relaxes key assumptions made by prior work. Empirically, we demonstrate that our mechanism incentivizes truthful data sharing via simulations and on real-world language and image data.
- North America > United States > Wisconsin > Dane County > Madison (0.04)
- Asia > Middle East > Jordan (0.04)
- Africa > South Sudan > Equatoria > Central Equatoria > Juba (0.04)
- (4 more...)
Learn then Decide: A Learning Approach for Designing Data Marketplaces
Gao, Yingqi, Zhou, Jin, Zhou, Hua, Chen, Yong, Dai, Xiaowu
As data marketplaces become increasingly central to the digital economy, it is crucial to design efficient pricing mechanisms that optimize revenue while ensuring fair and adaptive pricing. We introduce the Maximum Auction-to-Posted Price (MAPP) mechanism, a novel two-stage approach that first estimates the bidders' value distribution through auctions and then determines the optimal posted price based on the learned distribution. We establish that MAPP is individually rational and incentive-compatible, ensuring truthful bidding while balancing revenue maximization with minimal price discrimination. MAPP achieves a regret of $O_p(n^{-1})$ when incorporating historical bid data, where $n$ is the number of bids in the current round. It outperforms existing methods while imposing weaker distributional assumptions. For sequential dataset sales over $T$ rounds, we propose an online MAPP mechanism that dynamically adjusts pricing across datasets with varying value distributions. Our approach achieves no-regret learning, with the average cumulative regret converging at a rate of $O_p(T^{-1/2}(\log T)^2)$. We validate the effectiveness of MAPP through simulations and real-world data from the FCC AWS-3 spectrum auction.
- Asia > Middle East > Jordan (0.05)
- North America > United States > Oregon > Multnomah County > Portland (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
Private, Augmentation-Robust and Task-Agnostic Data Valuation Approach for Data Marketplace
Jahani-Nezhad, Tayyebeh, Moradi, Parsa, Maddah-Ali, Mohammad Ali, Caire, Giuseppe
Evaluating datasets in data marketplaces, where the buyer aim to purchase valuable data, is a critical challenge. In this paper, we introduce an innovative task-agnostic data valuation method called PriArTa which is an approach for computing the distance between the distribution of the buyer's existing dataset and the seller's dataset, allowing the buyer to determine how effectively the new data can enhance its dataset. PriArTa is communication-efficient, enabling the buyer to evaluate datasets without needing access to the entire dataset from each seller. Instead, the buyer requests that sellers perform specific preprocessing on their data and then send back the results. Using this information and a scoring metric, the buyer can evaluate the dataset. The preprocessing is designed to allow the buyer to compute the score while preserving the privacy of each seller's dataset, mitigating the risk of information leakage before the purchase. A key feature of PriArTa is its robustness to common data transformations, ensuring consistent value assessment and reducing the risk of purchasing redundant data. The effectiveness of PriArTa is demonstrated through experiments on real-world image datasets, showing its ability to perform privacy-preserving, augmentation-robust data valuation in data marketplaces.
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- North America > Canada > Ontario > Toronto (0.14)
- North America > United States > New York > New York County > New York City (0.04)
- (4 more...)
- Information Technology > Security & Privacy (1.00)
- Information Technology > Data Science > Data Mining (0.86)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)
- Information Technology > Artificial Intelligence > Representation & Reasoning (0.67)
Disentangled Structural and Featural Representation for Task-Agnostic Graph Valuation
Falahati, Ali, Amiri, Mohammad Mohammadi
With the emergence of data marketplaces, the demand for methods to assess the value of data has increased significantly. While numerous techniques have been proposed for this purpose, none have specifically addressed graphs as the main data modality. Graphs are widely used across various fields, ranging from chemical molecules to social networks. In this study, we break down graphs into two main components: structural and featural, and we focus on evaluating data without relying on specific task-related metrics, making it applicable in practical scenarios where validation requirements may be lacking. We introduce a novel framework called blind message passing, which aligns the seller's and buyer's graphs using a shared node permutation based on graph matching. This allows us to utilize the graph Wasserstein distance to quantify the differences in the structural distribution of graph datasets, called the structural disparities. We then consider featural aspects of buyers' and sellers' graphs for data valuation and capture their statistical similarities and differences, referred to as relevance and diversity, respectively. Our approach ensures that buyers and sellers remain unaware of each other's datasets. Our experiments on real datasets demonstrate the effectiveness of our approach in capturing the relevance, diversity, and structural disparities of seller data for buyers, particularly in graph-based data valuation scenarios.
- Information Technology (1.00)
- Health & Medicine > Pharmaceuticals & Biotechnology (0.93)
- Health & Medicine > Therapeutic Area > Oncology (0.46)
Data Acquisition: A New Frontier in Data-centric AI
Chen, Lingjiao, Acun, Bilge, Ardalani, Newsha, Sun, Yifan, Kang, Feiyang, Lyu, Hanrui, Kwon, Yongchan, Jia, Ruoxi, Wu, Carole-Jean, Zaharia, Matei, Zou, James
Datasets, the cornerstone of modern machine learning (ML) systems, have been increasingly sold and purchased for different ML pipelines [2]. Several data marketplaces have emerged to serve different stages of building ML-enhanced data applications. For example, NASDAQ Data Link [3] offers financial datasets cleaned and structured for model training, Amazon AWS data exchange [4] focuses on generic tabular datasets, and Databricks Marketplace [5] integrates raw datasets and ML pipelines to deliver insights. The data-as-a-service market size was more than 30 billions and is expected to double in the next five years [6]. While the data marketplaces are increasingly expanding, unfortunately, data acquisition for ML remains challenging, partially due to its ad-hoc nature: Based on discussions with real-world users, data acquirers often need to negotiate varying contracts with different data providers first, then purchase multiple datasets with different formats, and finally filtering out unnecessary data from the purchased datasets.
- North America > United States > California > Alameda County > Berkeley (0.14)
- North America > United States > California > Santa Clara County > Palo Alto (0.04)
- North America > United States > Virginia > Montgomery County > Blacksburg (0.04)
- (3 more...)
- Health & Medicine (1.00)
- Information Technology > Security & Privacy (0.88)
A Survey of Data Pricing for Data Marketplaces
Zhang, Mengxiao, Beltran, Fernando, Liu, Jiamou
A data marketplace is an online venue that brings data owners, data brokers, and data consumers together and facilitates commoditisation of data amongst them. Data pricing, as a key function of a data marketplace, demands quantifying the monetary value of data. A considerable number of studies on data pricing can be found in literature. This paper attempts to comprehensively review the state-of-the-art on existing data pricing studies to provide a general understanding of this emerging research area. Our key contribution lies in a new taxonomy of data pricing studies that unifies different attributes determining data prices. The basis of our framework categorises these studies by the kind of market structure, be it sell-side, buy-side, or two-sided. Then in a sell-side market, the studies are further divided by query type, which defines the way a data consumer accesses data, while in a buy-side market, the studies are divided according to privacy notion, which defines the way to quantify privacy of data owners. In a two-sided market, both privacy notion and query type are used as criteria. We systematically examine the studies falling into each category in our taxonomy. Lastly, we discuss gaps within the existing research and define future research directions.
- Oceania > New Zealand > North Island > Auckland Region > Auckland (0.05)
- North America > United States > Hawaii (0.04)
- South America > Colombia (0.04)
- (7 more...)
- Research Report (1.00)
- Overview (1.00)
- Information Technology > Security & Privacy (1.00)
- Health & Medicine (1.00)
- Banking & Finance (1.00)
ai-trends-how-will-be-ai-impact-you
As we approach the end of the first quarter, what does the future hold for AI? We already know that artificial intelligence (AI) has an impact on every industry around the globe. These are the areas where AI will be more important in our lives in 2022. AI is a data-hungry beast and has created new avenues for data collection that have increased the value of data as an asset to businesses and governments. There are also initiatives to educate the general public about how data can be used.