Law
Position: Bridge the Gaps between Machine Unlearning and AIRegulation
The "right to be forgotten" and the data privacy laws that encode it have motivated machine unlearning since its earliest days. Now, some argue that an inbound wave of artificial intelligence regulations -- like the European Union's Artificial Intelligence Act (AIA) -- may offer important new use cases for machine unlearning. However, this position paper argues, this opportunity will only be realized if researchers proactively bridge the (sometimes sizable) gaps between machine unlearning's state of the art and its potential applications to AI regulation. To demonstrate this point, we use the AIA as our primary case study. Specifically, we deliver a "state of the union" as regards machine unlearning's current potential (or, in many cases, lack thereof) for aiding compliance with various provisions of the AIA. This starts with a precise cataloging of the potential applications of machine unlearning to AIA compliance. For each, we flag the technical gaps that exist between the potential application and the state of the art of machine unlearning. Finally, we end with a call to action: for machine learning researchers to solve the open technical questions that could unlock machine unlearning's potential to assist compliance with the AIA -- and other AI regulations like it.
Machine Unlearning under Overparameterization
Machine unlearning algorithms aim to remove the influence of specific training samples, ideally recovering the model that would have resulted from training on the remaining data alone. We study unlearning in the overparameterized setting, where many models interpolate the data, and defining the solution as any loss minimizer over the retained set--as in prior work in the underparameterized setting--is inadequate, since the original model may already interpolate the retained data and satisfy this condition. In this regime, loss gradients vanish, rendering prior methods based on gradient perturbations ineffective, motivating both new unlearning definitions and algorithms. For this setting, we define the unlearning solution as the minimum-complexity interpolator over the retained data and propose a new algorithmic framework that only requires access to model gradients on the retained set at the original solution. We minimize a regularized objective over perturbations constrained to be orthogonal to these model gradients, a first-order relaxation of the interpolation condition. For different model classes, we provide exact and approximate unlearning guarantees and demonstrate that an implementation of our framework outperforms existing baselines across various unlearning experiments.
9d411e87d0f37059f40fb27c5de00ba0-Supplemental-Datasets_and_Benchmarks_Track.pdf
The following section is answers to questions listed in datasheets for datasets.858 A.1 Motivation859 Question: For what purpose was the dataset created? Was there a specific task in mind?860 Was there a specific gap that needed to be filled? Answer: To evaluate the linguistic robustness of language models across diverse English862 varieties by transforming Standard American English (SAE) datasets.863 Question: Who created the dataset (e.g., which team, research group) and on behalf of864 which entity (e.g., company, institution, organization)?865 Answer: The authors of this paper.866 Question: Who funded the creation of the dataset? If there is an associated grant, please867 provide the name of the grantor and the grant name and number.868
Trans-EnV: AFramework for Evaluating the Linguistic Robustness of LLMs Against English Varieties
Large Language Models (LLMs) are predominantly evaluated on Standard American English (SAE), often overlooking the diversity of global English varieties. This narrow focus may raise fairness concerns as degraded performance on nonstandard varieties can lead to unequal benefits for users worldwide. Therefore, it is critical to extensively evaluate the linguistic robustness of LLMs on multiple non-standard English varieties. We introduce Trans-EnV, a framework that automatically transforms SAE datasets into multiple English varieties to evaluate the linguistic robustness. Our framework combines (1) linguistics expert knowledge to curate variety-specific features and transformation guidelines from linguistic literature and corpora, and (2) LLM-based transformations to ensure both linguistic validity and scalability. Using Trans-EnV, we transform six benchmark datasets into 38 English varieties and evaluate seven state-of-the-art LLMs. Our results reveal significant performance disparities, with accuracy decreasing by up to 46.3% on non-standard varieties.
MMLONGBENCH: Benchmarking Long-Context Vision-Language Models Effectively and Thoroughly
The rapid extension of context windows in large vision-language models has given rise to long-context vision-language models (LCVLMs), which are capable of handling hundreds of images with interleaved text tokens in a single forward pass. In this work, we introduce MMLONGBENCH, the first benchmark covering a diverse set of long-context vision-language tasks, to evaluate LCVLMs effectively and thoroughly. MMLONGBENCH is composed of 13,331 examples spanning five different categories of downstream tasks, such as Visual RAG and Many-Shot ICL. It also provides broad coverage of image types, including various natural and synthetic images. To assess the robustness of the models to different input lengths, all examples are delivered at five standardized input lengths (8K-128K tokens) via a cross-modal tokenization scheme that combines vision patches and text tokens. Through a thorough benchmarking of 46 closed-source and open-source LCVLMs, we provide a comprehensive analysis of the current models' vision-language longcontext ability. Our results show that: i) performance on a single task is a weak proxy for overall long-context capability; ii) both closed-source and open-source models face challenges in long-context vision-language tasks, indicating substantial room for future improvement; iii) models with stronger reasoning ability tend to exhibit better long-context performance. By offering wide task coverage, various image types, and rigorous length control, MMLONGBENCH1 provides the missing foundation for diagnosing and advancing the next generation of LCVLMs.
PubSub-VFL: Towards Efficient Two-Party Split Learning in Heterogeneous Environments via Publisher/Subscriber Architecture
With the rapid advancement of the digital economy, data collaboration between organizations has become a well-established business model, driving the growth of various industries. However, privacy concerns make direct data sharing impractical. To address this, Two-Party Split Learning (a.k.a. Vertical Federated Learning (VFL)) has emerged as a promising solution for secure collaborative learning. Despite its advantages, this architecture still suffers from low computational resource utilization and training efficiency.
SentinelKilnDB: ALarge-Scale Dataset and Benchmark for OBBBrick Kiln Detection in South Asia Using Satellite Imagery Supplementary Information
The questions are presented in blue, with our corresponding responses shown in black. For what purpose was the dataset created? Was there a specific task in mind? This dataset was created for academic and research purposes to advance scientific understanding and support policy development on air quality and sustainability issues. The findings highlight important opportunities to improve regulatory compliance and encourage the adoption of cleaner technologies within the brick kiln sector, which is a significant contributor to regional air pollution. Beyond its environmental relevance, this dataset is especially valuable for the fields of object detection and computer vision. It provides a large-scale, hand-validated collection of brick kiln locations annotated with oriented bounding boxes (OBBs) on freely available Sentinel-2 satellite imagery.
Self-Calibrating BCIs: Ranking and Recovery of Mental Targets Without Labels
We consider the problem of recovering a mental target (e.g., an image of a face) that a participant has in mind from paired EEG (i.e., brain responses) and image (i.e., perceived faces) data collected during interactive sessions without access to labeled information. The problem has been previously explored with labeled data but not via self-calibration, where labeled data is unavailable. Here, we present the first framework and an algorithm, CURSOR, that learns to recover unknown mental targets without access to labeled data or pre-trained decoders. Our experiments on naturalistic images of faces demonstrate that CURSOR can (1) predict image similarity scores that correlate with human perceptual judgments without any label information, (2) use these scores to rank stimuli against an unknown mental target, and (3) generate new stimuli indistinguishable from the unknown mental target (validated via a user study, N = 53). We release the brain response data set (N = 29), associated face images used as stimuli data, and a codebase to initiate further research on this novel task.
SciArena: An Open Evaluation Platform for Non-Verifiable Scientific Literature-Grounded Tasks
Unlike traditional benchmarks for scientific literature understanding and synthesis, SciArena engages the research community directly, following the Chatbot Arena evaluation approach of community voting on model comparisons. By leveraging collective intelligence, SciArena offers a community-driven evaluation of model performance on open-ended scientific tasks that demand literature-grounded, long-form responses. The platform currently supports 47 foundation models and has collected over 20,000 votes from human researchers across diverse scientific domains. Our analysis of the data collected so far confirms its high quality. We discuss the results and insights based on the model ranking leaderboard. To further promote research in building modelbased automated evaluation systems for literature tasks, we release SciArena-Eval, a meta-evaluation benchmark based on collected preference data. It measures the accuracy of models in judging answer quality by comparing their pairwise assessments with human votes. Our experiments highlight the benchmark's challenges and emphasize the need for more reliable automated evaluation methods.
Michael Fassbender says it is becoming harder to know what to trust online
What happens if pretending to be someone else becomes your entire life? It is a question at the heart of many of the biggest spy dramas, from Slow Horses to Black Doves - and it is one that TV thriller series The Agency explores more deeply than most. Returning for a second season, the Paramount+ thriller follows CIA operatives living under deep-cover identities. It examines not just the dangers of espionage, but the psychological cost of maintaining a lie for years. Starring Michael Fassbender, Richard Gere and Katherine Waterston, the series is based on acclaimed French drama The Bureau.