Goto

Collaborating Authors

 data provenance


Blueprints of Trust: AI System Cards for End to End Transparency and Governance

Sidhpurwala, Huzaifa, Fox, Emily, Mollett, Garth, Gabarda, Florencio Cano, Zhukov, Roman

arXiv.org Artificial Intelligence

This paper introduces the Hazard-Aware System Card (HASC), a novel framework designed to enhance transparency and accountability in the development and deployment of AI systems. The HASC builds upon existing model card and system card concepts by integrating a comprehensive, dynamic record of an AI system's security and safety posture. The framework proposes a standardized system of identifiers, including a novel AI Safety Hazard (ASH) ID, to complement existing security identifiers like CVEs, allowing for clear and consistent communication of fixed flaws. By providing a single, accessible source of truth, the HASC empowers developers and stakeholders to make more informed decisions about AI system safety throughout its lifecycle. Ultimately, we also compare our proposed AI system cards with the ISO/IEC 42001:2023 standard and discuss how they can be used to complement each other, providing greater transparency and accountability for AI systems.


Data Virtualization for Machine Learning

Khan, Saiful, Chakraborty, Joyraj, Beaucamp, Philip, Bhujel, Niraj, Chen, Min

arXiv.org Artificial Intelligence

Nowadays, machine learning (ML) teams have multiple concurrent ML workflows for different applications. Each workflow typically involves many experiments, iterations, and collaborative activities and commonly takes months and sometimes years from initial data wrangling to model deployment. Organizationally, there is a large amount of intermediate data to be stored, processed, and maintained. \emph{Data virtualization} becomes a critical technology in an infrastructure to serve ML workflows. In this paper, we present the design and implementation of a data virtualization service, focusing on its service architecture and service operations. The infrastructure currently supports six ML applications, each with more than one ML workflow. The data virtualization service allows the number of applications and workflows to grow in the coming years.


Data Authenticity, Consent, & Provenance for AI are all broken: what will it take to fix them?

Longpre, Shayne, Mahari, Robert, Obeng-Marnu, Naana, Brannon, William, South, Tobin, Gero, Katy, Pentland, Sandy, Kabbara, Jad

arXiv.org Artificial Intelligence

New capabilities in foundation models are owed in large part to massive, widely-sourced, and under-documented training data collections. Existing practices in data collection have led to challenges in documenting data transparency, tracing authenticity, verifying consent, privacy, representation, bias, copyright infringement, and the overall development of ethical and trustworthy foundation models. In response, regulation is emphasizing the need for training data transparency to understand foundation models' limitations. Based on a large-scale analysis of the foundation model training data landscape and existing solutions, we identify the missing infrastructure to facilitate responsible foundation model development practices. We examine the current shortcomings of common tools for tracing data authenticity, consent, and documentation, and outline how policymakers, developers, and data creators can facilitate responsible foundation model development by adopting universal data provenance standards.


Enhancing Data Provenance and Model Transparency in Federated Learning Systems -- A Database Approach

Gu, Michael, Naraparaju, Ramasoumya, Zhao, Dongfang

arXiv.org Artificial Intelligence

Federated Learning (FL) presents a promising paradigm for training machine learning models across decentralized edge devices while preserving data privacy. Ensuring the integrity and traceability of data across these distributed environments, however, remains a critical challenge. The ability to create transparent artificial intelligence, such as detailing the training process of a machine learning model, has become an increasingly prominent concern due to the large number of sensitive (hyper)parameters it utilizes; thus, it is imperative to strike a reasonable balance between openness and the need to protect sensitive information. In this paper, we propose one of the first approaches to enhance data provenance and model transparency in federated learning systems. Our methodology leverages a combination of cryptographic techniques and efficient model management to track the transformation of data throughout the FL process, and seeks to increase the reproducibility and trustworthiness of a trained FL model. We demonstrate the effectiveness of our approach through experimental evaluations on diverse FL scenarios, showcasing its ability to tackle accountability and explainability across the board. Our findings show that our system can greatly enhance data transparency in various FL environments by storing chained cryptographic hashes and client model snapshots in our proposed design for data decoupled FL. This is made possible by also employing multiple optimization techniques which enables comprehensive data provenance without imposing substantial computational loads. Extensive experimental results suggest that integrating a database subsystem into federated learning systems can improve data provenance in an efficient manner, encouraging secure FL adoption in privacy-sensitive applications and paving the way for future advancements in FL transparency and security features.


Ulterior Motives

Communications of the ACM

Margo Seltzer, the Canada 150 Research Chair in Computer Systems at the University of British Columbia and 2023–2024 ACM Athena Lecturer, is the kind of researcher who stands out not just for her accomplishments, but for her tirelessness. After building a database software library that underpinned many first-generation Internet services, she worked on topics that range from file systems and storage to capturing and accessing data provenance. Here, she speaks with Leah Hoffmann about finding impactful research projects--and keeping up with everything that's going on in the field. The story of Berkeley DB, the database software library that you built with Keith Bostic and Mike Olson, has been told before at greater length, but let me see if I can summarize. Your work on packages such as hash and B-tree was released with Berkeley Unix as the DB 1.85 library.


WASA: WAtermark-based Source Attribution for Large Language Model-Generated Data

Wang, Jingtan, Lu, Xinyang, Zhao, Zitong, Dai, Zhongxiang, Foo, Chuan-Sheng, Ng, See-Kiong, Low, Bryan Kian Hsiang

arXiv.org Machine Learning

The impressive performances of large language models (LLMs) and their immense potential for commercialization have given rise to serious concerns over the intellectual property (IP) of their training data. In particular, the synthetic texts generated by LLMs may infringe the IP of the data being used to train the LLMs. To this end, it is imperative to be able to (a) identify the data provider who contributed to the generation of a synthetic text by an LLM (source attribution) and (b) verify whether the text data from a data provider has been used to train an LLM (data provenance). In this paper, we show that both problems can be solved by watermarking, i.e., by enabling an LLM to generate synthetic texts with embedded watermarks that contain information about their source(s). We identify the key properties of such watermarking frameworks (e.g., source attribution accuracy, robustness against adversaries), and propose a WAtermarking for Source Attribution (WASA) framework that satisfies these key properties due to our algorithmic designs. Our WASA framework enables an LLM to learn an accurate mapping from the texts of different data providers to their corresponding unique watermarks, which sets the foundation for effective source attribution (and hence data provenance). Extensive empirical evaluations show that our WASA framework achieves effective source attribution and data provenance.


Data Isotopes for Data Provenance in DNNs

Wenger, Emily, Li, Xiuyu, Zhao, Ben Y., Shmatikov, Vitaly

arXiv.org Artificial Intelligence

Today, creators of data-hungry deep neural networks (DNNs) scour the Internet for training fodder, leaving users with little control over or knowledge of when their data is appropriated for model training. To empower users to counteract unwanted data use, we design, implement and evaluate a practical system that enables users to detect if their data was used to train an DNN model. We show how users can create special data points we call isotopes, which introduce "spurious features" into DNNs during training. With only query access to a trained model and no knowledge of the model training process, or control of the data labels, a user can apply statistical hypothesis testing to detect if a model has learned the spurious features associated with their isotopes by training on the user's data. This effectively turns DNNs' vulnerability to memorization and spurious correlations into a tool for data provenance. Our results confirm efficacy in multiple settings, detecting and distinguishing between hundreds of isotopes with high accuracy. We further show that our system works on public ML-as-a-service platforms and larger models such as ImageNet, can use physical objects instead of digital marks, and remains generally robust against several adaptive countermeasures.


Data Provenance via Differential Auditing

Mu, Xin, Pang, Ming, Zhu, Feida

arXiv.org Artificial Intelligence

Auditing Data Provenance (ADP), i.e., auditing if a certain piece of data has been used to train a machine learning model, is an important problem in data provenance. The feasibility of the task has been demonstrated by existing auditing techniques, e.g., shadow auditing methods, under certain conditions such as the availability of label information and the knowledge of training protocols for the target model. Unfortunately, both of these conditions are often unavailable in real applications. In this paper, we introduce Data Provenance via Differential Auditing (DPDA), a practical framework for auditing data provenance with a different approach based on statistically significant differentials, i.e., after carefully designed transformation, perturbed input data from the target model's training set would result in much more drastic changes in the output than those from the model's non-training set. This framework allows auditors to distinguish training data from non-training ones without the need of training any shadow models with the help of labeled output data. Furthermore, we propose two effective auditing function implementations, an additive one and a multiplicative one. We report evaluations on real-world data sets demonstrating the effectiveness of our proposed auditing technique.


A "Glass Box" Approach to Responsible Machine Learning - insideBIGDATA

#artificialintelligence

Machine learning doesn't always have to be an abstruse technology. The multi-parameter and hyper-parameter methodology of complex deep neural networks, for example, is only one type of this cognitive computing manifestation. There are other machine learning varieties (and even some involving deep neural networks) in which the results of models, how they were determined, and which intricacies influenced them, are much more transparent. It all depends on how well organizations understand their data provenance. Comprehending just about everything that happened to training data for models, as well as that for the production data models encounter, is integral to explaining, refining, and improving their results.


6 Things You Need To Know About Data Management And Why It Matters For Computer Vision - KDnuggets

#artificialintelligence

The race to Industry 4.0 and accelerated adoption of digital automation are pressure testing organizations across industries. It is becoming increasingly matter-of-fact that an enterprise's ability to leverage data is a key source of competitive advantage--this principle especially holds true when it comes to building and maintaining computer vision applications. Visual automation models are powered by images and videos that capture a digital representation of our physical world. In most enterprises, media is captured across multiple sensor edge devices and lives in siloed source systems, making the integration of media across environments a core challenge that needs to be solved when building a computer vision system that seeks to automate visual inspection. Media becomes even more important in a world where model architectures (the code that builds up the neural networks on which an AI model is trained) are increasingly commoditized and stable.