Goto

Collaborating Authors

 localizer



Self-Chained Image-Language Model for Video Localization and Question Answering

Neural Information Processing Systems

Recent studies have shown promising results on utilizing large pre-trained image-language models for video question answering. While these image-language models can efficiently bootstrap the representation learning of video-language models, they typically concatenate uniformly sampled video frames as visual inputs without explicit language-aware, temporal modeling. When only a portion of a video input is relevant to the language query, such uniform frame sampling can often lead to missing important visual cues. Although humans often find a video moment to focus on and rewind the moment to answer questions, training a query-aware video moment localizer often requires expensive annotations and high computational costs. To address this issue, we propose Self-Chained Video Localization-Answering (SeViLA), a novel framework that leverages a single image-language model (BLIP-2) to tackle both temporal keyframe localization and question answering on videos.



Evaluating Contrast Localizer for Identifying Causal Units in Social & Mathematical Tasks in Language Models

Jamaa, Yassine, AlKhamissi, Badr, Ghosh, Satrajit, Schrimpf, Martin

arXiv.org Artificial Intelligence

This work adapts a neuroscientific contrast localizer to pinpoint causally relevant units for Theory of Mind (ToM) and mathematical reasoning tasks in large language models (LLMs) and vision-language models (VLMs). Across 11 LLMs and 5 VLMs ranging in size from 3B to 90B parameters, we localize top-activated units using contrastive stimulus sets and assess their causal role via targeted ablations. We compare the effect of lesioning functionally selected units against low-activation and randomly selected units on downstream accuracy across established ToM and mathematical benchmarks. Contrary to expectations, low-activation units sometimes produced larger performance drops than the highly activated ones, and units derived from the mathematical localizer often impaired ToM performance more than those from the ToM localizer. These findings call into question the causal relevance of contrast-based localizers and highlight the need for broader stimulus sets and more accurately capture task-specific units.


Surfer-H Meets Holo1: Cost-Efficient Web Agent Powered by Open Weights

Andreux, Mathieu, Skuk, Breno Baldas, Benchekroun, Hamza, Biré, Emilien, Bonnet, Antoine, Bordie, Riaz, Bout, Nathan, Brunel, Matthias, Cedoz, Pierre-Louis, Chassang, Antoine, Chen, Mickaël, Constantinou, Alexandra D., d'Andigné, Antoine, de La Jonquière, Hubert, Delfosse, Aurélien, Denoyer, Ludovic, Deprez, Alexis, Derupti, Augustin, Eickenberg, Michael, Federico, Mathïs, Kantor, Charles, Koegler, Xavier, Labbé, Yann, Lee, Matthew C. H., de Kergaradec, Erwan Le Jumeau, Mahla, Amir, Manevich, Avshalom, Maret, Adrien, Masson, Charles, Maurin, Rafaël, Mena, Arturo, Modard, Philippe, Moyal, Axel, Kerbel, Axel Nguyen, Revelle, Julien, Richter, Mats L., Santos, María, Sifre, Laurent, Theillard, Maxime, Thibault, Marc, Thiry, Louis, Tronchon, Léo, Usunier, Nicolas, Wu, Tony

arXiv.org Artificial Intelligence

Building AI agents requires designing systems capable of acting in and adapting to dynamic digital environments in real time. In this context, Large Language Models (LLMs) have made remarkable progress in reasoning and problem solving, rivaling or even surpassing human experts in domain-specific tasks [12, 32]. However, in their most fundamental form, LLMs are confined to a static, pre-trained world: they cannot act, verify, or access up-to-date information. For instance, they cannot answer questions about current events, book a restaurant table, or avoid hallucination [30, 35]. To circumvent their limitations, research has focused on enhancing LLMs with tool-use capabilities, enabling them to execute code snippets [7, 29], query Application Programming Interfaces (APIs) [18, 31], or retrieve information at scale with multi-step reasoning [33, 38, 24, 26]. These systems, often referred to 1 as agents, extend LLMs into more capable virtual assistants [36]. However, their real-world utility remains bounded by the available predefined tools and the engineering effort required to expand them [13]. Approaching this problem from another angle, computer use agents have recently emerged as a new paradigm in which agents interact with software directly through Graphical User Interfaces (GUIs) [1, 8, 11, 15, 17, 23, 39], i.e. using the same interface humans are presented with. This approach avoids relying on custom integrations or APIs, opening the door to more adaptable general-purpose agents with higher potential and broader real-world utility.


Towards Fine-Grained Video Question Answering

Dai, Wei, Luo, Alan, Durante, Zane, Dash, Debadutta, Milstein, Arnold, Schulman, Kevin, Adeli, Ehsan, Fei-Fei, Li

arXiv.org Artificial Intelligence

In the rapidly evolving domain of video understanding, Video Question Answering (VideoQA) remains a focal point. However, existing datasets exhibit gaps in temporal and spatial granularity, which consequently limits the capabilities of existing VideoQA methods. This paper introduces the Multi-Object Multi-Actor Question Answering (MOMA-QA) dataset, which is designed to address these shortcomings by emphasizing temporal localization, spatial relationship reasoning, and entity-centric queries. With ground truth scene graphs and temporal interval annotations, MOMA-QA is ideal for developing models for fine-grained video understanding. Furthermore, we present a novel video-language model, SGVLM, which incorporates a scene graph predictor, an efficient frame retriever, and a pre-trained large language model for temporal localization and fine-grained relationship understanding. Evaluations on MOMA-QA and other public datasets demonstrate the superior performance of our model, setting new benchmarks for VideoQA.


Otter: Generating Tests from Issues to Validate SWE Patches

Ahmed, Toufique, Ganhotra, Jatin, Pan, Rangeet, Shinnar, Avraham, Sinha, Saurabh, Hirzel, Martin

arXiv.org Artificial Intelligence

While there has been plenty of work on generating tests from existing code, there has been limited work on generating tests from issues. A correct test must validate the code patch that resolves the issue. In this work, we focus on the scenario where the code patch does not exist yet. This approach supports two major use-cases. First, it supports TDD (test-driven development), the discipline of "test first, write code later" that has well-documented benefits for human software engineers. Second, it also validates SWE (software engineering) agents, which generate code patches for resolving issues. This paper introduces Otter, an LLM-based solution for generating tests from issues. Otter augments LLMs with rule-based analysis to check and repair their outputs, and introduces a novel self-reflective action planning stage. Experiments show Otter outperforming state-of-the-art systems for generating tests from issues, in addition to enhancing systems that generate patches from issues. We hope that Otter helps make developers more productive at resolving issues and leads to more robust, well-tested code.


Self-Chained Image-Language Model for Video Localization and Question Answering

Neural Information Processing Systems

Recent studies have shown promising results on utilizing large pre-trained image-language models for video question answering. While these image-language models can efficiently bootstrap the representation learning of video-language models, they typically concatenate uniformly sampled video frames as visual inputs without explicit language-aware, temporal modeling. When only a portion of a video input is relevant to the language query, such uniform frame sampling can often lead to missing important visual cues. Although humans often find a video moment to focus on and rewind the moment to answer questions, training a query-aware video moment localizer often requires expensive annotations and high computational costs. To address this issue, we propose Self-Chained Video Localization-Answering (SeViLA), a novel framework that leverages a single image-language model (BLIP- 2) to tackle both temporal keyframe localization and question answering on videos.


The LLM Language Network: A Neuroscientific Approach for Identifying Causally Task-Relevant Units

AlKhamissi, Badr, Tuckute, Greta, Bosselut, Antoine, Schrimpf, Martin

arXiv.org Artificial Intelligence

Large language models (LLMs) exhibit remarkable capabilities on not just language tasks, but also various tasks that are not linguistic in nature, such as logical reasoning and social inference. In the human brain, neuroscience has identified a core language system that selectively and causally supports language processing. We here ask whether similar specialization for language emerges in LLMs. We identify language-selective units within 18 popular LLMs, using the same localization approach that is used in neuroscience. We then establish the causal role of these units by demonstrating that ablating LLM language-selective units -- but not random units -- leads to drastic deficits in language tasks. Correspondingly, language-selective LLM units are more aligned to brain recordings from the human language system than random units. Finally, we investigate whether our localization method extends to other cognitive domains: while we find specialized networks in some LLMs for reasoning and social capabilities, there are substantial differences among models. These findings provide functional and causal evidence for specialization in large language models, and highlight parallels with the functional organization in the brain.


Error Decomposition for Hybrid Localization Systems

Flade, Benedict, Kohaut, Simon, Eggert, Julian

arXiv.org Artificial Intelligence

Future advanced driver assistance systems and autonomous vehicles rely on accurate localization, which can be divided into three classes: a) viewpoint localization about local references (e.g., via vision-based localization), b) absolute localization about a global reference system (e.g., via satellite navigation), and c) hybrid localization, which presents a combination of the former two. Hybrid localization shares characteristics and strengths of both absolute and viewpoint localization. However, new sources of error, such as inaccurate sensor-setup calibration, complement the potential errors of the respective sub-systems. Therefore, this paper introduces a general approach to analyzing error sources in hybrid localization systems. More specifically, we propose the Kappa-Phi method, which allows for the decomposition of localization errors into individual components, i.e., into a sum of parameterized functions of the measured state (e.g., agent kinematics). The error components can then be leveraged to, e.g., improve localization predictions, correct map data, or calibrate sensor setups. Theoretical derivations and evaluations show that the algorithm presents a promising approach to improve hybrid localization and counter the weaknesses of the system's individual components.