Goto

Collaborating Authors

 data vendor




Towards Data Governance of Frontier AI Models

Hausenloy, Jason, McClements, Duncan, Thakur, Madhavendra

arXiv.org Artificial Intelligence

Data is essential to train and fine-tune today's frontier artificial intelligence (AI) models and to develop future ones. To date, academic, legal, and regulatory work has primarily addressed how data can directly harm consumers and creators, such as through privacy breaches, copyright infringements, and bias and discrimination. Our work, instead, focuses on the comparatively neglected question of how data can enable new governance capacities for frontier AI models. This approach for "frontier data governance" opens up new avenues for monitoring and mitigating risks from advanced AI models, particularly as they scale and acquire specific dangerous capabilities. Still, frontier data governance faces challenges that stem from the fundamental properties of data itself: data is non-rival, often non-excludable, easily replicable, and increasingly synthesizable. Despite these inherent difficulties, we propose a set of policy mechanisms targeting key actors along the data supply chain, including data producers, aggregators, model developers, and data vendors. We provide a brief overview of 15 governance mechanisms, of which we centrally introduce five, underexplored policy recommendations. These include developing canary tokens to detect unauthorized use for producers; (automated) data filtering to remove malicious content for pre-training and post-training datasets; mandatory dataset reporting requirements for developers and vendors; improved security for datasets and data generation algorithms; and know-your-customer requirements for vendors. By considering data not just as a source of potential harm, but as a critical governance lever, this work aims to equip policymakers with a new tool for the governance and regulation of frontier AI models.


Data Distribution Valuation

Xu, Xinyi, Wang, Shuaiqi, Foo, Chuan-Sheng, Low, Bryan Kian Hsiang, Fanti, Giulia

arXiv.org Artificial Intelligence

Data valuation is a class of techniques for quantitatively assessing the value of data for applications like pricing in data marketplaces. Existing data valuation methods define a value for a discrete dataset. However, in many use cases, users are interested in not only the value of the dataset, but that of the distribution from which the dataset was sampled. For example, consider a buyer trying to evaluate whether to purchase data from different vendors. The buyer may observe (and compare) only a small preview sample from each vendor, to decide which vendor's data distribution is most useful to the buyer and purchase. The core question is how should we compare the values of data distributions from their samples? Under a Huber characterization of the data heterogeneity across vendors, we propose a maximum mean discrepancy (MMD)-based valuation method which enables theoretically principled and actionable policies for comparing data distributions from samples. We empirically demonstrate that our method is sample-efficient and effective in identifying valuable data distributions against several existing baselines, on multiple real-world datasets (e.g., network intrusion detection, credit card fraud detection) and downstream applications (classification, regression).


The minimum viable data set

#artificialintelligence

Very often in the context of AI, it is mentioned that enormous amounts of data are required in order to work with it in the first place. Very complex models have to be programmed and the success of a project is often associated with many unpredictabilities and risks. However, as a general rule, this is completely wrong. This article is all about giving you a perspective on how to handle situations of data scarcity and the possibilities to consider in this context. Of course, there are complex projects that place extreme demands on the amount of data in order to achieve effective results, but usually this has to do with poor planning or a deliberately high willingness to experiment.


starmine AI On-Demand Datasets for Data Vendors, DaaS Providers and Institutions

#artificialintelligence

Starmine is a robust and highly scalable platform for constructing, trading and exchanging advanced algorithmically generated on-demand datasets for Machine Learning (ML) and Artificial Intelligence (AI) efforts. Datasets remain at the core of most advances in ML and Al. A dataset is typically made up of rows and columns, similar to an organized matrix or spreadsheet. Specifically, columns contain features along with continuously valued attributes or scores. These features and their scored attributes are stored as'supercolumns' and can be automatically engineered, traded or exchanged by machines without human intervention.


How to Better Classify Coachella With Machine Learning (Part 2)

#artificialintelligence

I get often asked how to start with Machine Learning. But then I consider myself a maker. I truly believe experience is key and solving actual real world applications are the key to unlock the mysteries. Once you have your first success at building and understanding a solution that solved your problem you can dig deeper and refine the building blocks. The problem that we faced, (see Part 1), was that we have a multitude of data vendors providing us with event information.