Law
'Mass theft': Thousands of artists call for AI art auction to be cancelled
Thousands of artists are urging the auction house Christie's to cancel a sale of art created with artificial intelligence, claiming the technology behind the works is committing "mass theft". The Augmented Intelligence auction has been described by Christie's as the first AI-dedicated sale by a major auctioneer and features 20 lots with prices ranging from 10,000 to 250,000 for works by artists including Refik Andanol and the late AI art pioneer Harold Cohen. A lettter calling for the auction to be scrapped has received 3,000 signatures, including from Karla Ortiz and Kelly McKernan, who are suing AI companies over claims that the firms' image generation tools have used their work without permission. These models, and the companies behind them, exploit human artists, using their work without permission or payment to build commercial AI products that compete with them." Calling on Christie's to cancel the auction, which starts on 20 February, it adds: "Your support of these models, and the people who use them, rewards and further incentivizes AI companies' mass theft of human artists' work." The British composer Ed Newton-Rex, a key figure in the campaign by creative professionals for protection of their work and a signatory to the letter, said at least nine of the works appearing in the auction appeared to have used models trained on artists' work. However, other pieces in the auction do not appear to have used such models. A spokesperson for Christie's said that "in most cases" the AI used to create art in the auction had been trained on the artists' "own inputs". "The artists represented in this sale all have strong, existing multidisciplinary art practices, some recognised in leading museum collections.
Recent Advances, Applications and Open Challenges in Machine Learning for Health: Reflections from Research Roundtables at ML4H 2024 Symposium
Adibi, Amin, Cao, Xu, Ji, Zongliang, Kaur, Jivat Neet, Chen, Winston, Healey, Elizabeth, Nuwagira, Brighton, Ye, Wenqian, Woollard, Geoffrey, Xu, Maxwell A, Cui, Hejie, Xi, Johnny, Chang, Trenton, Bikia, Vasiliki, Zhang, Nicole, Noori, Ayush, Xia, Yuan, Hossain, Md. Belal, Frank, Hanna A., Peluso, Alina, Pu, Yuan, Shen, Shannon Zejiang, Wu, John, Fallahpour, Adibvafa, Mahbub, Sazan, Duncan, Ross, Zhang, Yuwei, Cao, Yurui, Xu, Zuheng, Craig, Michael, Krishnan, Rahul G., Beheshti, Rahmatollah, Rehg, James M., Karim, Mohammad Ehsanul, Coffee, Megan, Celi, Leo Anthony, Fries, Jason Alan, Sadatsafavi, Mohsen, Shung, Dennis, McWeeney, Shannon, Dafflon, Jessica, Jabbour, Sarah
The fourth Machine Learning for Health (ML4H) symposium was held in person on December 15th and 16th, 2024, in the traditional, ancestral, and unceded territories of the Musqueam, Squamish, and Tsleil-Waututh Nations in Vancouver, British Columbia, Canada. The symposium included research roundtable sessions to foster discussions between participants and senior researchers on timely and relevant topics for the ML4H community. The organization of the research roundtables at the conference involved 13 senior and 27 junior chairs across 13 tables. Each roundtable session included an invited senior chair (with substantial experience in the field), junior chairs (responsible for facilitating the discussion), and attendees from diverse backgrounds with an interest in the session's topic.
Transparent NLP: Using RAG and LLM Alignment for Privacy Q&A
Leschanowsky, Anna, Kolagar, Zahra, Çano, Erion, Habernal, Ivan, Hallinan, Dara, Habets, Emanuël A. P., Popp, Birgit
The transparency principle of the General Data Protection Regulation (GDPR) requires data processing information to be clear, precise, and accessible. While language models show promise in this context, their probabilistic nature complicates truthfulness and comprehensibility. This paper examines state-of-the-art Retrieval Augmented Generation (RAG) systems enhanced with alignment techniques to fulfill GDPR obligations. We evaluate RAG systems incorporating an alignment module like Rewindable Auto-regressive Inference (RAIN) and our proposed multidimensional extension, MultiRAIN, using a Privacy Q&A dataset. Responses are optimized for preciseness and comprehensibility and are assessed through 21 metrics, including deterministic and large language model-based evaluations. Our results show that RAG systems with an alignment module outperform baseline RAG systems on most metrics, though none fully match human answers. Principal component analysis of the results reveals complex interactions between metrics, highlighting the need to refine metrics. This study provides a foundation for integrating advanced natural language processing systems into legal compliance frameworks.
Can We Trust AI Benchmarks? An Interdisciplinary Review of Current Issues in AI Evaluation
Eriksson, Maria, Purificato, Erasmo, Noroozian, Arman, Vinagre, Joao, Chaslot, Guillaume, Gomez, Emilia, Fernandez-Llorca, David
Quantitative Artificial Intelligence (AI) Benchmarks have emerged as fundamental tools for evaluating the performance, capability, and safety of AI models and systems. Currently, they shape the direction of AI development and are playing an increasingly prominent role in regulatory frameworks. As their influence grows, however, so too does concerns about how and with what effects they evaluate highly sensitive topics such as capabilities, including high-impact capabilities, safety and systemic risks. This paper presents an interdisciplinary meta-review of about 100 studies that discuss shortcomings in quantitative benchmarking practices, published in the last 10 years. It brings together many fine-grained issues in the design and application of benchmarks (such as biases in dataset creation, inadequate documentation, data contamination, and failures to distinguish signal from noise) with broader sociotechnical issues (such as an over-focus on evaluating text-based AI models according to one-time testing logic that fails to account for how AI models are increasingly multimodal and interact with humans and other technical systems). Our review also highlights a series of systemic flaws in current benchmarking practices, such as misaligned incentives, construct validity issues, unknown unknowns, and problems with the gaming of benchmark results. Furthermore, it underscores how benchmark practices are fundamentally shaped by cultural, commercial and competitive dynamics that often prioritise state-of-the-art performance at the expense of broader societal concerns. By providing an overview of risks associated with existing benchmarking procedures, we problematise disproportionate trust placed in benchmarks and contribute to ongoing efforts to improve the accountability and relevance of quantitative AI benchmarks within the complexities of real-world scenarios.
Scaling Multi-Document Event Summarization: Evaluating Compression vs. Full-Text Approaches
Pratapa, Adithya, Mitamura, Teruko
Automatically summarizing large text collections is a valuable tool for document research, with applications in journalism, academic research, legal work, and many other fields. In this work, we contrast two classes of systems for large-scale multi-document summarization (MDS): compression and full-text. Compression-based methods use a multi-stage pipeline and often lead to lossy summaries. Full-text methods promise a lossless summary by relying on recent advances in long-context reasoning. To understand their utility on large-scale MDS, we evaluated them on three datasets, each containing approximately one hundred documents per summary. Our experiments cover a diverse set of long-context transformers (Llama-3.1, Command-R, Jamba-1.5-Mini) and compression methods (retrieval-augmented, hierarchical, incremental). Overall, we find that full-text and retrieval methods perform the best in most settings. With further analysis into the salient information retention patterns, we show that compression-based methods show strong promise at intermediate stages, even outperforming full-context. However, they suffer information loss due to their multi-stage pipeline and lack of global context. Our results highlight the need to develop hybrid approaches that combine compression and full-text approaches for optimal performance on large-scale multi-document summarization.
Perceived Confidence Scoring for Data Annotation with Zero-Shot LLMs
Salimian, Sina, Uddin, Gias, Jahan, Most Husne, Raza, Shaina
Zero-shot LLMs are now also used for textual classification tasks, e.g., sentiment/emotion detection of a given input as a sentence/article. However, their performance can be suboptimal in such data annotation tasks. We introduce a novel technique Perceived Confidence Scoring (PCS) that evaluates LLM's confidence for its classification of an input by leveraging Metamorphic Relations (MRs). The MRs generate semantically equivalent yet textually mutated versions of the input. Following the principles of Metamorphic Testing (MT), the mutated versions are expected to have annotation labels similar to the input. By analyzing the consistency of LLM responses across these variations, PCS computes a confidence score based on the frequency of predicted labels. PCS can be used both for single LLM and multiple LLM settings (e.g., majority voting). We introduce an algorithm Perceived Differential Evolution (PDE) that determines the optimal weights assigned to the MRs and the LLMs for a classification task. Empirical evaluation shows PCS significantly improves zero-shot accuracy for Llama-3-8B-Instruct (4.96%) and Mistral-7B-Instruct-v0.3 (10.52%), with Gemma-2-9b-it showing a 9.39% gain. When combining all three models, PCS significantly outperforms majority voting by 7.75%.
AI Alignment at Your Discretion
Buyl, Maarten, Khalaf, Hadi, Verdun, Claudio Mayrink, Paes, Lucas Monteiro, Machado, Caio C. Vieira, Calmon, Flavio du Pin
In AI alignment, extensive latitude must be granted to annotators, either human or algorithmic, to judge which model outputs are `better' or `safer.' We refer to this latitude as alignment discretion. Such discretion remains largely unexamined, posing two risks: (i) annotators may use their power of discretion arbitrarily, and (ii) models may fail to mimic this discretion. To study this phenomenon, we draw on legal concepts of discretion that structure how decision-making authority is conferred and exercised, particularly in cases where principles conflict or their application is unclear or irrelevant. Extended to AI alignment, discretion is required when alignment principles and rules are (inevitably) conflicting or indecisive. We present a set of metrics to systematically analyze when and how discretion in AI alignment is exercised, such that both risks (i) and (ii) can be observed. Moreover, we distinguish between human and algorithmic discretion and analyze the discrepancy between them. By measuring both human and algorithmic discretion over safety alignment datasets, we reveal layers of discretion in the alignment process that were previously unaccounted for. Furthermore, we demonstrate how algorithms trained on these datasets develop their own forms of discretion in interpreting and applying these principles, which challenges the purpose of having any principles at all. Our paper presents the first step towards formalizing this core gap in current alignment processes, and we call on the community to further scrutinize and control alignment discretion.
Unconstrained Body Recognition at Altitude and Range: Comparing Four Approaches
Myers, Blake A, Hill, Matthew Q, Gandi, Veda Nandan, Metz, Thomas M, O'Toole, Alice J
This study presents an investigation of four distinct approaches to long-term person identification using body shape. Unlike short-term re-identification systems that rely on temporary features (e.g., clothing), we focus on learning persistent body shape characteristics that remain stable over time. We introduce a body identification model based on a Vision Transformer (ViT) (Body Identification from Diverse Datasets, BIDDS) and on a Swin-ViT model (Swin-BIDDS). We also expand on previous approaches based on the Linguistic and Non-linguistic Core ResNet Identity Models (LCRIM and NLCRIM), but with improved training. All models are trained on a large and diverse dataset of over 1.9 million images of approximately 5k identities across 9 databases. Performance was evaluated on standard re-identification benchmark datasets (MARS, MSMT17, Outdoor Gait, DeepChange) and on an unconstrained dataset that includes images at a distance (from close-range to 1000m), at altitude (from an unmanned aerial vehicle, UAV), and with clothing change. A comparative analysis across these models provides insights into how different backbone architectures and input image sizes impact long-term body identification performance across real-world conditions.
AIMS.au: A Dataset for the Analysis of Modern Slavery Countermeasures in Corporate Statements
Bora, Adriana Eufrosiana, St-Charles, Pierre-Luc, Bronzi, Mirko, Tchango, Arsène Fansi, Rousseau, Bruno, Mengersen, Kerrie
Despite over a decade of legislative efforts to address modern slavery in the supply chains of large corporations, the effectiveness of government oversight remains hampered by the challenge of scrutinizing thousands of statements annually. While Large Language Models (LLMs) can be considered a well established solution for the automatic analysis and summarization of documents, recognizing concrete modern slavery countermeasures taken by companies and differentiating those from vague claims remains a challenging task. To help evaluate and fine-tune LLMs for the assessment of corporate statements, we introduce a dataset composed of 5,731 modern slavery statements taken from the Australian Modern Slavery Register and annotated at the sentence level. This paper details the construction steps for the dataset that include the careful design of annotation specifications, the selection and preprocessing of statements, and the creation of high-quality annotation subsets for effective model evaluations. To demonstrate our dataset's utility, we propose a machine learning methodology for the detection of sentences relevant to mandatory reporting requirements set by the Australian Modern Slavery Act. We then follow this methodology to benchmark modern language models under zero-shot and supervised learning settings.