AITopics

Country:

Europe > Switzerland > Zürich > Zürich (0.14)
Asia > Middle East > Israel > Tel Aviv District > Tel Aviv (0.04)
Europe > Romania > Sud - Muntenia Development Region > Giurgiu County > Giurgiu (0.04)

Genre: Research Report (0.66)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.72)
Information Technology > Artificial Intelligence > Natural Language > Question Answering (0.72)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Neural Information Processing SystemsDec-27-2025, 05:02:09 GMT

Self-Chained Image-Language Model for Video Localization and Question Answering

Recent studies have shown promising results on utilizing large pre-trained image-language models for video question answering. While these image-language models can efficiently bootstrap the representation learning of video-language models, they typically concatenate uniformly sampled video frames as visual inputs without explicit language-aware, temporal modeling. When only a portion of a video input is relevant to the language query, such uniform frame sampling can often lead to missing important visual cues. Although humans often find a video moment to focus on and rewind the moment to answer questions, training a query-aware video moment localizer often requires expensive annotations and high computational costs. To address this issue, we propose Self-Chained Video Localization-Answering (SeViLA), a novel framework that leverages a single image-language model (BLIP-2) to tackle both temporal keyframe localization and question answering on videos.

name change, self-chained image-language model, video localization, (5 more...)

Technology: Information Technology > Artificial Intelligence > Natural Language (1.00)

Neural Information Processing SystemsOct-9-2025, 11:37:16 GMT

Overleaf Example

large language model, machine learning, natural language, (18 more...)

Country:

Europe > Switzerland > Zürich > Zürich (0.14)
Asia > Middle East > Israel > Tel Aviv District > Tel Aviv (0.04)
Europe > Romania > Sud - Muntenia Development Region > Giurgiu County > Giurgiu (0.04)

Genre: Research Report (0.46)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.72)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Jamaa, Yassine, AlKhamissi, Badr, Ghosh, Satrajit, Schrimpf, Martin

Evaluating Contrast Localizer for Identifying Causal Units in Social & Mathematical Tasks in Language Models

arXiv.org Artificial IntelligenceAug-26-2025

This work adapts a neuroscientific contrast localizer to pinpoint causally relevant units for Theory of Mind (ToM) and mathematical reasoning tasks in large language models (LLMs) and vision-language models (VLMs). Across 11 LLMs and 5 VLMs ranging in size from 3B to 90B parameters, we localize top-activated units using contrastive stimulus sets and assess their causal role via targeted ablations. We compare the effect of lesioning functionally selected units against low-activation and randomly selected units on downstream accuracy across established ToM and mathematical benchmarks. Contrary to expectations, low-activation units sometimes produced larger performance drops than the highly activated ones, and units derived from the mathematical localizer often impaired ToM performance more than those from the ToM localizer. These findings call into question the causal relevance of contrast-based localizers and highlight the need for broader stimulus sets and more accurately capture task-specific units.

large language model, machine learning, natural language, (16 more...)

2508.08276

Country:

Asia (0.68)
North America > United States (0.14)

Genre:

Research Report > New Finding (0.66)
Research Report > Experimental Study (0.49)

Industry: Health & Medicine > Therapeutic Area > Neurology (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.69)

arXiv.org Artificial IntelligenceJun-12-2025

Surfer-H Meets Holo1: Cost-Efficient Web Agent Powered by Open Weights

Andreux, Mathieu, Skuk, Breno Baldas, Benchekroun, Hamza, Biré, Emilien, Bonnet, Antoine, Bordie, Riaz, Bout, Nathan, Brunel, Matthias, Cedoz, Pierre-Louis, Chassang, Antoine, Chen, Mickaël, Constantinou, Alexandra D., d'Andigné, Antoine, de La Jonquière, Hubert, Delfosse, Aurélien, Denoyer, Ludovic, Deprez, Alexis, Derupti, Augustin, Eickenberg, Michael, Federico, Mathïs, Kantor, Charles, Koegler, Xavier, Labbé, Yann, Lee, Matthew C. H., de Kergaradec, Erwan Le Jumeau, Mahla, Amir, Manevich, Avshalom, Maret, Adrien, Masson, Charles, Maurin, Rafaël, Mena, Arturo, Modard, Philippe, Moyal, Axel, Kerbel, Axel Nguyen, Revelle, Julien, Richter, Mats L., Santos, María, Sifre, Laurent, Theillard, Maxime, Thibault, Marc, Thiry, Louis, Tronchon, Léo, Usunier, Nicolas, Wu, Tony

Building AI agents requires designing systems capable of acting in and adapting to dynamic digital environments in real time. In this context, Large Language Models (LLMs) have made remarkable progress in reasoning and problem solving, rivaling or even surpassing human experts in domain-specific tasks [12, 32]. However, in their most fundamental form, LLMs are confined to a static, pre-trained world: they cannot act, verify, or access up-to-date information. For instance, they cannot answer questions about current events, book a restaurant table, or avoid hallucination [30, 35]. To circumvent their limitations, research has focused on enhancing LLMs with tool-use capabilities, enabling them to execute code snippets [7, 29], query Application Programming Interfaces (APIs) [18, 31], or retrieve information at scale with multi-step reasoning [33, 38, 24, 26]. These systems, often referred to 1 as agents, extend LLMs into more capable virtual assistants [36]. However, their real-world utility remains bounded by the available predefined tools and the engineering effort required to expand them [13]. Approaching this problem from another angle, computer use agents have recently emerged as a new paradigm in which agents interact with software directly through Graphical User Interfaces (GUIs) [1, 8, 11, 15, 17, 23, 39], i.e. using the same interface humans are presented with. This approach avoids relying on custom integrations or APIs, opening the door to more adaptable general-purpose agents with higher potential and broader real-world utility.

large language model, machine learning, natural language, (20 more...)

2506.02865

Country: Asia > Thailand (0.14)

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

arXiv.org Artificial IntelligenceMar-9-2025

Towards Fine-Grained Video Question Answering

Dai, Wei, Luo, Alan, Durante, Zane, Dash, Debadutta, Milstein, Arnold, Schulman, Kevin, Adeli, Ehsan, Fei-Fei, Li

In the rapidly evolving domain of video understanding, Video Question Answering (VideoQA) remains a focal point. However, existing datasets exhibit gaps in temporal and spatial granularity, which consequently limits the capabilities of existing VideoQA methods. This paper introduces the Multi-Object Multi-Actor Question Answering (MOMA-QA) dataset, which is designed to address these shortcomings by emphasizing temporal localization, spatial relationship reasoning, and entity-centric queries. With ground truth scene graphs and temporal interval annotations, MOMA-QA is ideal for developing models for fine-grained video understanding. Furthermore, we present a novel video-language model, SGVLM, which incorporates a scene graph predictor, an efficient frame retriever, and a pre-trained large language model for temporal localization and fine-grained relationship understanding. Evaluations on MOMA-QA and other public datasets demonstrate the superior performance of our model, setting new benchmarks for VideoQA.

dataset, moma-qa, scene graph, (13 more...)

2503.0682

Country:

North America > United States > California > Santa Clara County > Palo Alto (0.04)
North America > United States > Hawaii > Honolulu County > Honolulu (0.04)
North America > Canada > Ontario > Toronto (0.04)
Asia > Middle East > UAE > Abu Dhabi Emirate > Abu Dhabi (0.04)

Genre: Research Report > New Finding (0.46)

Industry:

Leisure & Entertainment > Sports (0.68)
Health & Medicine (0.68)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Question Answering (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.93)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.68)

arXiv.org Artificial IntelligenceFeb-7-2025

Otter: Generating Tests from Issues to Validate SWE Patches

Ahmed, Toufique, Ganhotra, Jatin, Pan, Rangeet, Shinnar, Avraham, Sinha, Saurabh, Hirzel, Martin

While there has been plenty of work on generating tests from existing code, there has been limited work on generating tests from issues. A correct test must validate the code patch that resolves the issue. In this work, we focus on the scenario where the code patch does not exist yet. This approach supports two major use-cases. First, it supports TDD (test-driven development), the discipline of "test first, write code later" that has well-documented benefits for human software engineers. Second, it also validates SWE (software engineering) agents, which generate code patches for resolving issues. This paper introduces Otter, an LLM-based solution for generating tests from issues. Otter augments LLMs with rule-based analysis to check and repair their outputs, and introduces a novel self-reflective action planning stage. Experiments show Otter outperforming state-of-the-art systems for generating tests from issues, in addition to enhancing systems that generate patches from issues. We hope that Otter helps make developers more productive at resolving issues and leads to more robust, well-tested code.

large language model, machine learning, natural language, (21 more...)

2502.05368

Country: North America > United States > New York (0.04)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.31)

Neural Information Processing SystemsJan-20-2025, 02:19:09 GMT

Self-Chained Image-Language Model for Video Localization and Question Answering

Recent studies have shown promising results on utilizing large pre-trained image-language models for video question answering. While these image-language models can efficiently bootstrap the representation learning of video-language models, they typically concatenate uniformly sampled video frames as visual inputs without explicit language-aware, temporal modeling. When only a portion of a video input is relevant to the language query, such uniform frame sampling can often lead to missing important visual cues. Although humans often find a video moment to focus on and rewind the moment to answer questions, training a query-aware video moment localizer often requires expensive annotations and high computational costs. To address this issue, we propose Self-Chained Video Localization-Answering (SeViLA), a novel framework that leverages a single image-language model (BLIP- 2) to tackle both temporal keyframe localization and question answering on videos.

artificial intelligence, natural language, question answering, (5 more...)

Technology: Information Technology > Artificial Intelligence > Natural Language > Question Answering (0.89)

AlKhamissi, Badr, Tuckute, Greta, Bosselut, Antoine, Schrimpf, Martin

The LLM Language Network: A Neuroscientific Approach for Identifying Causally Task-Relevant Units

arXiv.org Artificial IntelligenceNov-4-2024

Large language models (LLMs) exhibit remarkable capabilities on not just language tasks, but also various tasks that are not linguistic in nature, such as logical reasoning and social inference. In the human brain, neuroscience has identified a core language system that selectively and causally supports language processing. We here ask whether similar specialization for language emerges in LLMs. We identify language-selective units within 18 popular LLMs, using the same localization approach that is used in neuroscience. We then establish the causal role of these units by demonstrating that ablating LLM language-selective units -- but not random units -- leads to drastic deficits in language tasks. Correspondingly, language-selective LLM units are more aligned to brain recordings from the human language system than random units. Finally, we investigate whether our localization method extends to other cognitive domains: while we find specialized networks in some LLMs for reasoning and social capabilities, there are substantial differences among models. These findings provide functional and causal evidence for specialization in large language models, and highlight parallels with the functional organization in the brain.

language 0, large language model, natural language, (17 more...)

2411.0228

Country:

North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
Europe > Belgium > Brussels-Capital Region > Brussels (0.04)
Asia > China > Hong Kong (0.04)

Genre: Research Report > New Finding (1.00)

Industry: Health & Medicine > Therapeutic Area > Neurology (1.00)

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

Flade, Benedict, Kohaut, Simon, Eggert, Julian

Error Decomposition for Hybrid Localization Systems

arXiv.org Artificial IntelligenceOct-18-2024

Future advanced driver assistance systems and autonomous vehicles rely on accurate localization, which can be divided into three classes: a) viewpoint localization about local references (e.g., via vision-based localization), b) absolute localization about a global reference system (e.g., via satellite navigation), and c) hybrid localization, which presents a combination of the former two. Hybrid localization shares characteristics and strengths of both absolute and viewpoint localization. However, new sources of error, such as inaccurate sensor-setup calibration, complement the potential errors of the respective sub-systems. Therefore, this paper introduces a general approach to analyzing error sources in hybrid localization systems. More specifically, we propose the Kappa-Phi method, which allows for the decomposition of localization errors into individual components, i.e., into a sum of parameterized functions of the measured state (e.g., agent kinematics). The error components can then be leveraged to, e.g., improve localization predictions, correct map data, or calibrate sensor setups. Theoretical derivations and evaluations show that the algorithm presents a promising approach to improve hybrid localization and counter the weaknesses of the system's individual components.

artificial intelligence, localization, machine learning, (18 more...)

doi: 10.1109/ITSC48978.2021.9564415

2410.14264

Country:

Europe > Germany > Hesse > Darmstadt Region > Darmstadt (0.04)
North America > United States > California > San Diego County > San Diego (0.04)
Europe > Netherlands > North Brabant > Eindhoven (0.04)
Europe > Austria > Styria > Graz (0.04)

Genre: Research Report (0.70)

Industry: Automobiles & Trucks (0.66)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (0.93)
Information Technology > Artificial Intelligence > Machine Learning (0.93)
Information Technology > Artificial Intelligence > Robots > Autonomous Vehicles (0.48)