AITopics

Country:

Asia > China (0.04)
Asia > Singapore (0.04)
Europe > Switzerland > Zürich > Zürich (0.04)

Genre: Research Report > Experimental Study (0.93)

Industry: Information Technology (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Vision > Image Understanding (0.68)

Neural Information Processing SystemsFeb-16-2026, 13:50:33 GMT

af01716e08073368a7c8a62be46dba17-Paper-Conference.pdf

large language model, machine learning, natural language, (18 more...)

Country:

Asia > Middle East > Israel (0.04)
Asia > China > Shanghai > Shanghai (0.04)

Genre: Research Report > New Finding (0.67)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.67)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.47)

Neural Information Processing SystemsDec-27-2025, 02:29:19 GMT

Collaborative Score Distillation for Consistent Visual Editing

Generative priors of large-scale text-to-image diffusion models enable a wide range of new generation and editing applications on diverse visual modalities. However, when adapting these priors to complex visual modalities, often represented as multiple images (e.g., video or 3D scene), achieving consistency across a set of images is challenging. In this paper, we address this challenge with a novel method, Collaborative Score Distillation (CSD). CSD is based on the Stein Variational Gradient Descent (SVGD). Specifically, we propose to consider multiple samples as "particles" in the SVGD update and combine their score functions to distill generative priors over a set of images synchronously.

collaborative score distillation, consistent visual editing, name change, (6 more...)

Genre: Research Report (0.40)

Technology: Information Technology > Artificial Intelligence > Machine Learning (1.00)

Neural Information Processing SystemsOct-9-2025, 04:50:09 GMT

af01716e08073368a7c8a62be46dba17-Paper-Conference.pdf

large language model, machine learning, natural language, (18 more...)

Country:

Asia > Middle East > Israel (0.04)
Asia > China > Shanghai > Shanghai (0.04)

Genre: Research Report > New Finding (0.67)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.67)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.47)

arXiv.org Artificial IntelligenceOct-8-2025

Do AI Models Perform Human-like Abstract Reasoning Across Modalities?

Beger, Claas, Yi, Ryan, Fu, Shuhao, Moskvichev, Arseny, Tsai, Sarah W., Rajamanickam, Sivasankaran, Mitchell, Melanie

OpenAI's o3-preview reasoning model exceeded human accuracy on the ARC-AGI benchmark, but does that mean state-of-the-art models recognize and reason with the abstractions that the task creators intended? We investigate models' abstraction abilities on ConceptARC. We evaluate models under settings that vary the input modality (textual vs. visual), whether the model is permitted to use external Python tools, and, for reasoning models, the amount of reasoning effort. In addition to measuring output accuracy, we perform fine-grained evaluation of the natural-language rules that models generate to explain their solutions. This dual evaluation lets us assess whether models solve tasks using the abstractions ConceptARC was designed to elicit, rather than relying on surface-level patterns. Our results show that, while some models using text-based representations match human output accuracy, the best models' rules are often based on surface-level ``shortcuts'' and capture intended abstractions far less often than humans. Thus their capabilities for general abstract reasoning may be overestimated by evaluations based on accuracy alone. In the visual modality, AI models' output accuracy drops sharply, yet our rule-level analysis reveals that models might be underestimated, as they still exhibit a substantial share of rules that capture intended abstractions, but are often unable to correctly apply these rules. In short, our results show that models still lag humans in abstract reasoning, and that using accuracy alone to evaluate abstract reasoning on ARC-like tasks may overestimate abstract-reasoning capabilities in textual modalities and underestimate it in visual modalities. We believe that our evaluation framework offers a more faithful picture of multimodal models' abstract reasoning abilities and a more principled way to track progress toward human-like, abstraction-centered intelligence.

accuracy, large language model, machine learning, (17 more...)

2510.02125

Country: North America > United States (1.00)

Genre: Research Report > New Finding (1.00)

Industry:

Health & Medicine (1.00)
Energy (1.00)
Government > Regional Government > North America Government > United States Government (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
(2 more...)

Li, Jinghan Xu Yuyang Zhang Qixuan Cai Jiancheng Chen Keqiu

Preserving Cross-Modal Stability for Visual Unlearning in Multimodal Scenarios

arXiv.org Artificial IntelligenceSep-30-2025

Visual modality is the most vulnerable to privacy leakage in real-world multimodal applications like autonomous driving with visual and radar data; Machine unlearning removes specific training data from pre-trained models to address privacy leakage, however, existing methods fail to preserve cross-modal knowledge and maintain intra-class structural stability of retain data, leading to reduced overall and other modalities' performance during visual unlearning; to address these challenges, we propose a Cross-modal Contrastive Unlearning (CCU) framework, which integrates three key components: (a) selective visual unlearning: employing inverse contrastive learning to dissociate visual representations from their original semantics, (b) cross-modal knowledge retention: preserving other modalities' discriminability through semantic consistency, and (c) dual-set contrastive separation: preserving the model performance via isolation of structural perturbations between the unlearn set and retain set; extensive experiments on three datasets demonstrate the superiority of CCU, and our method achieves a 7.12% accuracy improvement with only 7% of the unlearning time compared to the top-accuracy baseline.

artificial intelligence, machine learning, modality, (16 more...)

2509.23895

Country: North America (0.28)

Genre: Research Report (0.50)

Industry: Information Technology > Security & Privacy (1.00)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.68)

arXiv.org Artificial IntelligenceSep-29-2025

Towards an AI Musician: Synthesizing Sheet Music Problems for Musical Reasoning

Wang, Zhilin, Yang, Zhe, Luo, Yun, Li, Yafu, Qu, Xiaoye, Qiao, Ziqian, Zhang, Haoran, Zhan, Runzhe, Wong, Derek F., Zhou, Jizhe, Cheng, Yu

Enhancing the ability of Large Language Models (LLMs) and Multimodal Large Language Models (MLLMs) to interpret sheet music is a crucial step toward building AI musicians. However, current research lacks both evaluation benchmarks and training data for sheet music reasoning. Inspired by mathematics, where simple operations yield infinite verifiable problems, we introduce a novel approach that treats core music theory rules, such as those governing beats and intervals, as programmatic functions to systematically synthesize a vast and diverse corpus of sheet music reasoning problems. This approach allows us to introduce a data synthesis framework that generates verifiable sheet music questions in both textual and visual modalities, leading to the Synthetic Sheet Music Reasoning Benchmark (SSMR-Bench) and a complementary training set. Evaluation results on SSMR-Bench highlight the key role reasoning plays in interpreting sheet music, while also pointing out the ongoing challenges in understanding sheet music in a visual format. By leveraging synthetic data for RL VR, all models show significant improvements on the SSMR-Bench. Additionally, they also demonstrate considerable advancements on previously established human-crafted benchmarks, such as MusicTheoryBench and the music subset of MMMU. Finally, our results show that the enhanced reasoning ability can also facilitate music composition. "The sheet music is the language of musicians." Recent advancements in Large Language Models (LLMs) and Multimodal Large Language Models (MLLMs) have inspired researchers to explore the potential of developing AI musicians (Qu et al., 2025; Bradshaw & Colton, 2025; Wang et al., 2024). Given that sheet music is the universal language of musicians, the ability to read and interpret it is an essential step for AI musicians (Y uan et al., 2024; Wang et al., 2025). As illustrated in Figure 1, sheet music reasoning differs fundamentally from Music Knowledge QA (Li et al., 2024), which evaluates memorized knowledge, and from sheet music recognition (Chen et al., 2025a), which focuses on identifying notation from images.

large language model, machine learning, natural language, (22 more...)

2509.04059

Country: Asia > China (0.28)

Genre: Research Report > New Finding (1.00)

Industry:

Media > Music (1.00)
Leisure & Entertainment (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.97)

arXiv.org Artificial IntelligenceSep-22-2025

DualEdit: Dual Editing for Knowledge Updating in Vision-Language Models

Shi, Zhiyi, Wang, Binjie, Si, Chongjie, Wu, Yichen, Kim, Junsik, Pfister, Hanspeter

Model editing aims to efficiently update a pre-trained model's knowledge without the need for time-consuming full retraining. While existing pioneering editing methods achieve promising results, they primarily focus on editing single-modal language models (LLMs). However, for vision-language models (VLMs), which involve multiple modalities, the role and impact of each modality on editing performance remain largely unexplored. To address this gap, we explore the impact of textual and visual modalities on model editing and find that: (1) textual and visual representations reach peak sensitivity at different layers, reflecting their varying importance; and (2) editing both modalities can efficiently update knowledge, but this comes at the cost of compromising the model's original capabilities. Based on our findings, we propose DualEdit, an editor that modifies both textual and visual modalities at their respective key layers. Additionally, we introduce a gating module within the more sensitive textual modality, allowing DualEdit to efficiently update new knowledge while preserving the model's original information. We evaluate DualEdit across multiple VLM backbones and benchmark datasets, demonstrating its superiority over state-of-the-art VLM editing baselines as well as adapted LLM editing methods on different evaluation metrics. Codes are available at https://github.com/zhiyiscs/DualEdit

artificial intelligence, large language model, natural language, (16 more...)

2506.13638

Genre: Research Report > New Finding (0.48)

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.90)

arXiv.org Artificial IntelligenceSep-19-2025

Out-of-Sight Trajectories: Tracking, Fusion, and Prediction

Zhang, Haichao, Xu, Yi, Fu, Yun

Trajectory prediction is a critical task in computer vision and autonomous systems, playing a key role in autonomous driving, robotics, surveillance, and virtual reality. Existing methods often rely on complete and noise-free observational data, overlooking the challenges associated with out-of-sight objects and the inherent noise in sensor data caused by limited camera coverage, obstructions, and the absence of ground truth for denoised trajectories. These limitations pose safety risks and hinder reliable prediction in real-world scenarios. In this extended work, we present advancements in Out-of-Sight Trajectory (OST), a novel task that predicts the noise-free visual trajectories of out-of-sight objects using noisy sensor data. Building on our previous research, we broaden the scope of Out-of-Sight Trajectory Prediction (OOSTraj) to include pedestrians and vehicles, extending its applicability to autonomous driving, robotics, surveillance, and virtual reality. Our enhanced Vision-Positioning Denoising Module leverages camera calibration to establish a vision-positioning mapping, addressing the lack of visual references, while effectively denoising noisy sensor data in an unsupervised manner. Through extensive evaluations on the Vi-Fi and JRDB datasets, our approach achieves state-of-the-art performance in both trajectory denoising and prediction, significantly surpassing previous baselines. Additionally, we introduce comparisons with traditional denoising methods, such as Kalman filtering, and adapt recent trajectory prediction models to our task, providing a comprehensive benchmark. This work represents the first initiative to integrate vision-positioning projection for denoising noisy sensor trajectories of out-of-sight agents, paving the way for future advances. The code and preprocessed datasets are available at github.com/Hai-chao-Zhang/OST

artificial intelligence, machine learning, trajectory, (17 more...)

2509.15219

Country: North America > United States (0.28)

Genre:

Research Report > New Finding (0.67)
Research Report > Promising Solution (0.46)

Industry:

Information Technology (0.55)
Transportation (0.55)
Automobiles & Trucks (0.55)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (1.00)
(2 more...)

arXiv.org Artificial IntelligenceAug-27-2025

Structures Meet Semantics: Multimodal Fusion via Graph Contrastive Learning

Sun, Jiangfeng, He, Sihao, Ou, Zhonghong, Song, Meina

Multimodal sentiment analysis (MSA) aims to infer emotional states by effectively integrating textual, acoustic, and visual modalities. Despite notable progress, existing multimodal fusion methods often neglect modality-specific structural dependencies and semantic misalignment, limiting their quality, interpretability, and robustness. To address these challenges, we propose a novel framework called the Structural-Semantic Unifier (SSU), which systematically integrates modality-specific structural information and cross-modal semantic grounding for enhanced multimodal representations. Specifically, SSU dynamically constructs modality-specific graphs by leveraging linguistic syntax for text and a lightweight, text-guided attention mechanism for acoustic and visual modalities, thus capturing detailed intra-modal relationships and semantic interactions. We further introduce a semantic anchor, derived from global textual semantics, that serves as a cross-modal alignment hub, effectively harmonizing heterogeneous semantic spaces across modalities. Additionally, we develop a multiview contrastive learning objective that promotes discriminability, semantic consistency, and structural coherence across intra- and inter-modal views. Extensive evaluations on two widely used benchmark datasets, CMU-MOSI and CMU-MOSEI, demonstrate that SSU consistently achieves state-of-the-art performance while significantly reducing computational overhead compared to prior methods. Comprehensive qualitative analyses further validate SSU's interpretability and its ability to capture nuanced emotional patterns through semantically grounded interactions.

large language model, machine learning, natural language, (19 more...)

2508.18322

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.70)
Information Technology > Artificial Intelligence > Natural Language > Information Extraction (0.52)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.47)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.46)