Zürich
+ + Dataset: Vision-Language Model Sensitivity to Semantic and Lexical Alterations
Despite their remarkable successes, state-of-the-art large language models (LLMs), including vision-and-language models (VLMs) and unimodal language models (ULMs), fail to understand precise semantics. For example, semantically equivalent sentences expressed using different lexical compositions elicit diverging representations. The degree of this divergence and its impact on encoded semantics is not very well understood.
NaturalBench: Evaluating Vision-Language Models on Natural Adversarial Samples
Vision-language models (VLMs) have made significant progress in recent visualquestion-answering (VQA) benchmarks that evaluate complex visio-linguistic reasoning. However, are these models truly effective? In this work, we show that VLMs still struggle with natural images and questions that humans can easily answer, which we term natural adversarial samples. We also find it surprisingly easy to generate these VQA samples from natural image-text corpora using offthe-shelf models like CLIP and ChatGPT. We propose a semi-automated approach to collect a new benchmark, NaturalBench, for reliably evaluating VLMs with 10,000 human-verified VQA samples.
Covariance-Aware Private Mean Estimation Without Private Covariance Estimation Marco Gaboardi Department of Computer Science Department of Computer Science Boston University
Each of our estimators is based on a simple, general approach to designing differentially private mechanisms, but with novel technical steps to make the estimator private and sample-efficient. Our first estimator samples a point with approximately maximum Tukey depth using the exponential mechanism, but restricted to the set of points of large Tukey depth. Proving that this mechanism is private requires a novel analysis. Our second estimator perturbs the empirical mean of the data set with noise calibrated to the empirical covariance, without releasing the covariance itself. Its sample complexity guarantees hold more generally for subgaussian distributions, albeit with a slightly worse dependence on the privacy parameter. For both estimators, careful preprocessing of the data is required to satisfy differential privacy.
'Unethical' AI research on Reddit under fire
A study that used artificial intelligence–generated content to "participate" in online discussions and test whether AI was more successful at changing people's minds than human-generated content has caused an uproar because of ethical concerns about the work. This week some of the unwitting research participants publicly asked the University of Zürich (UZH), where the researchers behind the experiment hold positions, to investigate and apologize. "I think people have a reasonable expectation to not be in scientific experiments without their consent," says Casey Fiesler, an expert on internet research ethics at the University of Colorado Boulder. A university statement emailed to Science says the researchers--who remain anonymous--have decided not to publish their results. The university will investigate the incident, the statement says.
Researchers secretly experimented on Reddit users with AI-generated comments
A group of researchers covertly ran a months-long "unauthorized" experiment in one of Reddit's most popular communities using AI-generated comments to test the persuasiveness of large language models. The experiment, which was revealed over the weekend by moderators of r/changemyview, is described by Reddit mods as "psychological manipulation" of unsuspecting users. "The CMV Mod Team needs to inform the CMV community about an unauthorized experiment conducted by researchers from the University of Zurich on CMV users," the subreddit's moderators wrote in a lengthy post notifying Redditors about the research. "This experiment deployed AI-generated comments to study how AI could be used to change views." The researchers used LLMs to create comments in response to posts on r/changemyview, a subreddit where Reddit users post (often controversial or provocative) opinions and request debate from other users.
Reddit users were subjected to AI-powered experiment without consent
Reddit users who were unwittingly subjected to an AI-powered experiment have hit back at scientists for conducting research on them without permission – and have sparked a wider debate about such experiments. The social media site Reddit is split into "subreddits" dedicated to a particular community, each with its own volunteer moderators. Members of one subreddit called r/ChangeMyView, because it invites people to discuss potentially contentious issues, were recently informed by the moderators that researchers at the University of Zurich, Switzerland, had been using the site as an online laboratory. The team's experiment seeded more than 1700 comments generated by a variety of large language models (LLMs) into the subreddit, without disclosing they weren't real, to gauge people's reactions. These comments included ones mimicking people who had been raped or pretending to be a trauma counsellor specialising in abuse, among others.
A Swiss Army Knife for Heterogeneous Federated Learning: Flexible Coupling via Trace Norm
The heterogeneity issue in federated learning (FL) has attracted increasing attention, which is attempted to be addressed by most existing methods. Currently, due to systems and objectives heterogeneity, enabling clients to hold models of different architectures and tasks of different demands has become an important direction in FL. Most existing FL methods are based on the homogeneity assumption, namely, different clients have the same architectural models with the same tasks, which are unable to handle complex and multivariate data and tasks. To flexibly address these heterogeneity limitations, we propose a novel federated multi-task learning framework with the help of tensor trace norm, FedSAK. Specifically, it treats each client as a task and splits the local model into a feature extractor and a prediction head. Clients can flexibly choose shared structures based on heterogeneous situations and upload them to the server, which learns correlations among client models by mining model low-rank structures through tensor trace norm. Furthermore, we derive convergence and generalization bounds under non-convex settings. Evaluated on 6 real-world datasets compared to 13 advanced FL models, FedSAK demonstrates superior performance.
Segment Anything without Supervision
The Segmentation Anything Model (SAM) requires labor-intensive data labeling. We present Unsupervised SAM (UnSAM) for promptable and automatic wholeimage segmentation that does not require human annotations. UnSAM utilizes a divide-and-conquer strategy to "discover" the hierarchical structure of visual scenes.
Zero-shot Generalizable Incremental Learning for Vision-Language Object Detection
This paper presents Incremental Vision-Language Object Detection (IVLOD), a novel learning task designed to incrementally adapt pre-trained Vision-Language Object Detection Models (VLODMs) to various specialized domains, while simultaneously preserving their zero-shot generalization capabilities for the generalized domain. To address this new challenge, we present the Zero-interference Reparameterizable Adaptation (ZiRa), a novel method that introduces Zero-interference Loss and reparameterization techniques to tackle IVLOD without incurring a significant increase in memory usage. Comprehensive experiments on COCO and ODinW-13 datasets demonstrate that ZiRa effectively safeguards the zeroshot generalization ability of VLODMs while continuously adapting to new tasks. Specifically, after training on ODinW-13 datasets, ZiRa exhibits superior performance compared to CL-DETR and iDETR, boosting zero-shot generalizability by substantial 13.91 and 8.74 AP, respectively.
Unveil Benign Overfitting for Transformer in Vision: Training Dynamics, Convergence, and Generalization
Transformers have demonstrated great power in the recent development of large foundational models. In particular, the Vision Transformer (ViT) has brought revolutionary changes to the field of vision, achieving significant accomplishments on the experimental side. However, their theoretical capabilities, particularly in terms of generalization when trained to overfit training data, are still not fully understood. To address this gap, this work delves deeply into the benign overfitting perspective of transformers in vision. To this end, we study the optimization of a Transformer composed of a self-attention layer with softmax followed by a fully connected layer under gradient descent on a certain data distribution model. By developing techniques that address the challenges posed by softmax and the interdependent nature of multiple weights in transformer optimization, we successfully characterized the training dynamics and achieved generalization in post-training. Our results establish a sharp condition that can distinguish between the small test error phase and the large test error regime, based on the signal-to-noise ratio in the data model. The theoretical results are further verified by experimental simulation. To the best of our knowledge, this is the first work to characterize benign overfitting for Transformers.