Goto

Collaborating Authors

 Large Language Model


ConMe: Rethinking Evaluation of Compositional Reasoning for Modern VLMs

Neural Information Processing Systems

Compositional Reasoning (CR) entails grasping the significance of attributes, relations, and word order. Recent Vision-Language Models (VLMs), comprising a visual encoder and a Large Language Model (LLM) decoder, have demonstrated remarkable proficiency in such reasoning tasks. This prompts a crucial question: have VLMs effectively tackled the CR challenge? We conjecture that existing CR benchmarks may not adequately push the boundaries of modern VLMs due to the reliance on an LLM only negative text generation pipeline. Consequently, the negatives produced either appear as outliers from the natural language distribution learned by VLMs' LLM decoders or as improbable within the corresponding image context. To address these limitations, we introduce ConMe\footnote{ConMe is an abbreviation for Confuse Me.} -- a compositional reasoning benchmark and a novel data generation pipeline leveraging VLMs to produce `hard CR Q&A'. Through a new concept of VLMs conversing with each other to collaboratively expose their weaknesses, our pipeline autonomously generates, evaluates, and selects challenging compositional reasoning questions, establishing a robust CR benchmark, also subsequently validated manually. Our benchmark provokes a noteworthy, up to 33%, decrease in CR performance compared to preceding benchmarks, reinstating the CR challenge even for state-of-the-art VLMs.


FLoRA: Federated Fine-Tuning Large Language Models with Heterogeneous Low-Rank Adaptations

Neural Information Processing Systems

The rapid development of Large Language Models (LLMs) has been pivotal in advancing AI, with pre-trained LLMs being adaptable to diverse downstream tasks through fine-tuning. Federated learning (FL) further enhances fine-tuning in a privacy-aware manner by utilizing clients' local data through in-situ computation, eliminating the need for data movement. However, fine-tuning LLMs, given their massive scale of parameters, poses challenges for clients with constrained and heterogeneous resources in FL. Previous methods employed low-rank adaptation (LoRA) for efficient federated fine-tuning but utilized traditional FL aggregation strategies on LoRA adapters. This approach led to mathematically inaccurate aggregation noise, reducing fine-tuning effectiveness and failing to address heterogeneous LoRAs.


Efficient multi-prompt evaluation of LLMs

Neural Information Processing Systems

Most popular benchmarks for comparing LLMs rely on a limited set of prompt templates, which may not fully capture the LLMs' abilities and can affect the reproducibility of results on leaderboards. Many recent works empirically verify prompt sensitivity and advocate for changes in LLM evaluation. In this paper, we consider the problem of estimating the performance distribution across many prompt variants instead of finding a single prompt to evaluate with. We introduce PromptEval, a method for estimating performance across a large set of prompts borrowing strength across prompts and examples to produce accurate estimates under practical evaluation budgets. The resulting distribution can be used to obtain performance quantiles to construct various robust performance metrics (e.g., top 95% quantile or median). We prove that PromptEval consistently estimates the performance distribution and demonstrate its efficacy empirically on three prominent LLM benchmarks: MMLU, BIG-bench Hard, and LMentry; for example, PromptEval can accurately estimate performance quantiles across 100 prompt templates on MMLU with a budget equivalent to two single-prompt evaluations. Moreover, we show how PromptEval can be useful in LLM-as-a-judge and best prompt identification applications.


Google Shakes Up Its Browser Agent Team Amid OpenClaw Craze

WIRED

As Silicon Valley obsesses over a new wave of AI coding agents, Google and other AI labs are shifting their bets. Google is shaking up the team behind Project Mariner, its AI agent that can navigate the Chrome browser and complete tasks on a user's behalf, WIRED has learned. In recent months, some Google Labs staffers who worked on the research prototype have moved on to higher-priority projects, according to two people familiar with the matter. A Google spokesperson confirmed the changes, but said the computer use capabilities developed under Project Mariner will be incorporated into the company's agent strategy moving forward. Google has already folded some of these capabilities into other agent products, including the recently launched Gemini Agent, the spokesperson added.


RAGChecker: A Fine-grained Framework for Diagnosing Retrieval-Augmented Generation

Neural Information Processing Systems

Despite Retrieval-Augmented Generation (RAG) has shown promising capability in leveraging external knowledge, a comprehensive evaluation of RAG systems is still challenging due to the modular nature of RAG, evaluation of long-form responses and reliability of measurements. In this paper, we propose a fine-grained evaluation framework, RAGChecker, that incorporates a suite of diagnostic metrics for both the retrieval and generation modules. Meta evaluation verifies that RAGChecker has significantly better correlations with human judgments than other evaluation metrics. Using RAGChecker, we evaluate 8 RAG systems and conduct an in-depth analysis of their performance, revealing insightful patterns and trade-offs in the design choices of RAG architectures. The metrics of RAGChecker can guide researchers and practitioners in developing more effective RAG systems.


Watermarking Makes Language Models Radioactive

Neural Information Processing Systems

We investigate the radioactivity of text generated by large language models (LLM), \ie whether it is possible to detect that such synthetic input was used to train a subsequent LLM.Current methods like membership inference or active IP protection either work only in settings where the suspected text is known or do not provide reliable statistical guarantees.We discover that, on the contrary, it is possible to reliably determine if a language model was trained on synthetic data if that data is output by a watermarked LLM.Our new methods, specialized for radioactivity, detects with a provable confidence weak residuals of the watermark signal in the fine-tuned LLM.We link the radioactivity contamination level to the following properties: the watermark robustness, its proportion in the training set, and the fine-tuning process.For instance, if the suspect model is open-weight, we demonstrate that training on watermarked instructions can be detected with high confidence ($p$-value $ < 10^{-5}$) even when as little as $5\%$ of training text is watermarked.


Mystery AI model suspected to be DeepSeek V4 is revealed to be from Xiaomi

The Japan Times

A powerful artificial intelligence model that appeared anonymously on a developer platform last week was revealed to be from Chinese smartphone and electric vehicle giant Xiaomi, and not DeepSeek as initially thought. BEIJING - A powerful artificial intelligence model that appeared anonymously on a developer platform last week was revealed on Wednesday to be from Chinese smartphone and electric vehicle giant Xiaomi, after it fueled speculation that startup DeepSeek was quietly testing its next-generation system ahead of a launch. The release of DeepSeek's low-cost models DeepSeek-V3 and R1 triggered a global tech stock selloff last year, causing investors to question whether U.S. AI firms needed to spend billions of dollars on AI computing power. Since then, there has been a great deal of interest in DeepSeek-V4, a next-generation model that has yet to be released. The mysterious free model, called Hunter Alpha, surfaced on the AI gateway platform OpenRouter on March 11 without any developer attribution and was later described by the platform as a "stealth model." In a time of both misinformation and too much information, quality journalism is more crucial than ever.


MemoryFormer : Minimize Transformer Computation by Removing Fully-Connected Layers

Neural Information Processing Systems

In order to reduce the computational complexity of large language models, great efforts have been made to to improve the efficiency of transformer models such as linear attention and flash-attention. However, the model size and corresponding computational complexity are constantly scaled up in pursuit of higher performance. In this work, we present MemoryFormer, a novel transformer architecture which significantly reduces the computational complexity (FLOPs) from a new perspective. We eliminate nearly all the computations of the transformer model except for the necessary computation required by the multi-head attention operation. This is made possible by utilizing an alternative method for feature transformation to replace the linear projection of fully-connected layers. Specifically, we first construct a group of in-memory lookup tables that store a large amount of discrete vectors to replace the weight matrix used in linear projection.


Grounding Multimodal Large Language Models in Actions

Neural Information Processing Systems

Multimodal Large Language Models (MLLMs) have demonstrated a wide range of capabilities across many domains including Embodied AI. In this work, we study how to best ground a MLLM into different embodiments and their associated action spaces, including both continuous and discrete actions. For continuous actions, a set of learned tokenizations that capture an action at various resolutions allows for sufficient modeling precision, yielding the best performance on downstream tasks. For discrete actions, semantically aligning these actions with the native output token space of the MLLM leads to the strongest performance. We arrive at these lessons via a thorough study of seven action grounding approaches on five different environments, encompassing over 114 embodied tasks.


DropBP: Accelerating Fine-Tuning of Large Language Models by Dropping Backward Propagation

Neural Information Processing Systems

Large language models (LLMs) have achieved significant success across various domains. However, training these LLMs typically involves substantial memory and computational costs during both forward and backward propagation. While parameter-efficient fine-tuning (PEFT) considerably reduces the training memory associated with parameters, it does not address the significant computational costs and activation memory. In this paper, we propose Dropping Backward Propagation (DropBP), a novel approach designed to reduce computational costs and activation memory while maintaining accuracy. DropBP randomly drops layers during backward propagation, which is essentially equivalent to training shallow submodules generated by undropped layers and residual connections. Additionally, DropBP calculates the sensitivity of each layer to assign an appropriate drop rate, thereby stabilizing the training process. DropBP is not only applicable to full fine-tuning but can also be orthogonally integrated with all types of PEFT by dropping layers during backward propagation. Specifically, DropBP can reduce training time by 44% with comparable accuracy to the baseline, accelerate convergence to the same perplexity by 1.5$\times$, and enable training with a sequence length 6.2$\times$ larger on a single NVIDIA-A100 GPU.