AITopics | internvl3

Collaborating Authors

internvl3

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Semi-off-Policy Reinforcement Learning for Vision-Language Slow-Thinking Reasoning

Neural Information Processing SystemsJun-22-2026, 04:00:13 GMT

Enhancing large vision-language models (LVLMs) with visual slow-thinking reasoning is crucial for solving complex multimodal tasks. However, since LVLMs are mainly trained with vision-language alignment, it is difficult to adopt on-policy reinforcement learning (RL) to develop the slow thinking ability because the rollout space is restricted by its initial abilities. Off-policy RL offers a way to go beyond the current policy, but directly distilling trajectories from external models may cause visual hallucinations due to mismatched visual perception abilities across models. To address these issues, this paper proposes SOPHIA, a simple and scalable SemiOff-Policy RL for vision-language slow-tHInking reAsoning. SOPHIA builds a semi-off-policy behavior model by combining on-policy visual understanding from a trainable LVLM with off-policy slow-thinking reasoning from a language model, assigns outcome-based rewards to reasoning, and propagates visual rewards backward. Then LVLM learns slow-thinking reasoning ability from the obtained reasoning trajectories using propagated rewards via off-policy RL algorithms.

arxiv preprint arxiv, large language model, machine learning, (20 more...)

Neural Information Processing Systems

Country:

North America > United States (0.93)
Asia (0.93)
Europe (0.67)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (1.00)
Overview (0.92)
Workflow (0.67)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
(2 more...)

Add feedback

Semi-off-Policy Reinforcement Learning for Vision-Language Slow-Thinking Reasoning

Neural Information Processing SystemsJun-13-2026, 23:35:46 GMT

artificial intelligence, machine learning, reinforcement learning, (8 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.41)

Add feedback

Same Content, Different Answers: Cross-Modal Inconsistency in MLLMs

van Sprang, Angela, Samson, Laurens, Lucic, Ana, Acar, Erman, Ghebreab, Sennay, Asano, Yuki M.

arXiv.org Artificial IntelligenceDec-10-2025

We introduce two new benchmarks REST and REST+(Render-Equivalence Stress Tests) to enable systematic evaluation of cross-modal inconsistency in multimodal large language models (MLLMs). MLLMs are trained to represent vision and language in the same embedding space, yet they cannot perform the same tasks in both modalities. Our benchmarks contain samples with the same semantic information in three modalities (image, text, mixed) and we show that state-of-the-art MLLMs cannot consistently reason over these different modalities. We evaluate 15 MLLMs and find that the degree of modality inconsistency varies substantially, even when accounting for problems with text recognition (OCR). Neither rendering text as image nor rendering an image as text solves the inconsistency. Even if OCR is correct, we find that visual characteristics (text colour and resolution, but not font) and the number of vision tokens have an impact on model performance. Finally, we find that our consistency score correlates with the modality gap between text and images, highlighting a mechanistic interpretation of cross-modal inconsistent MLLMs.

large language model, machine learning, natural language, (21 more...)

arXiv.org Artificial Intelligence

2512.08923

Genre: Research Report > New Finding (0.92)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

CycliST: A Video Language Model Benchmark for Reasoning on Cyclical State Transitions

Kohaut, Simon, Ochs, Daniel, Zhang, Shun, Flade, Benedict, Eggert, Julian, Kersting, Kristian, Dhami, Devendra Singh

arXiv.org Artificial IntelligenceDec-2-2025

We present CycliST, a novel benchmark dataset designed to evaluate Video Language Models (VLM) on their ability for textual reasoning over cyclical state transitions. CycliST captures fundamental aspects of real-world processes by generating synthetic, richly structured video sequences featuring periodic patterns in object motion and visual attributes. CycliST employs a tiered evaluation system that progressively increases difficulty through variations in the number of cyclic objects, scene clutter, and lighting conditions, challenging state-of-the-art models on their spatio-temporal cognition. We conduct extensive experiments with current state-of-the-art VLMs, both open-source and proprietary, and reveal their limitations in generalizing to cyclical dynamics such as linear and orbital motion, as well as time-dependent changes in visual attributes like color and scale. Our results demonstrate that present-day VLMs struggle to reliably detect and exploit cyclic patterns, lack a notion of temporal understanding, and are unable to extract quantitative insights from scenes, such as the number of objects in motion, highlighting a significant technical gap that needs to be addressed. More specifically, we find no single model consistently leads in performance: neither size nor architecture correlates strongly with outcomes, and no model succeeds equally well across all tasks. By providing a targeted challenge and a comprehensive evaluation framework, CycliST paves the way for visual reasoning models that surpass the state-of-the-art in understanding periodic patterns.

artificial intelligence, machine learning, natural language, (15 more...)

arXiv.org Artificial Intelligence

2512.01095

Country:

Europe (1.00)
North America > United States (0.28)

Genre: Research Report > New Finding (0.68)

Industry: Health & Medicine (0.46)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

ReGATE: Learning Faster and Better with Fewer Tokens in MLLMs

Li, Chaoyu, Kulkarni, Yogesh, Fazli, Pooyan

arXiv.org Artificial IntelligenceDec-1-2025

The computational cost of training multimodal large language models (MLLMs) grows rapidly with the number of processed tokens. Existing efficiency methods mainly target inference via token reduction or merging, offering limited benefits during training. We introduce ReGATE (Reference-Guided Adaptive Token Elision), an adaptive token pruning method for accelerating MLLM training. ReGATE adopts a teacher-student framework, in which a frozen teacher LLM provides per-token guidance losses that are fused with an exponential moving average of the student's difficulty estimates. This adaptive scoring mechanism dynamically selects informative tokens while skipping redundant ones in the forward pass, substantially reducing computation without altering the model architecture. Across three representative MLLMs, ReGATE matches the peak accuracy of standard training on MVBench up to 2$\times$ faster, using only 38% of the tokens. With extended training, it even surpasses the baseline across multiple multimodal benchmarks, cutting total token usage by over 41%. Code and models will be released publicly.

large language model, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2507.2142

Genre: Research Report (0.82)

Industry: Education (0.68)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.97)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.97)

Add feedback

ENACT: Evaluating Embodied Cognition with World Modeling of Egocentric Interaction

Wang, Qineng, Huang, Wenlong, Zhou, Yu, Yin, Hang, Bao, Tianwei, Lyu, Jianwen, Liu, Weiyu, Zhang, Ruohan, Wu, Jiajun, Fei-Fei, Li, Li, Manling

arXiv.org Artificial IntelligenceNov-27-2025

Embodied cognition argues that intelligence arises from sensorimotor interaction rather than passive observation. It raises an intriguing question: do modern vision-language models (VLMs), trained largely in a disembodied manner, exhibit signs of embodied cognition? We introduce ENACT, a benchmark that casts evaluation of embodied cognition as world modeling from egocentric interaction in a visual question answering (VQA) format. Framed as a partially observable Markov decision process (POMDP) whose actions are scene graph changes, ENACT comprises two complementary sequence reordering tasks: forward world modeling (reorder shuffled observations given actions) and inverse world modeling (reorder shuffled actions given observations). While conceptually simple, solving these tasks implicitly demands capabilities central to embodied cognition-affordance recognition, action-effect reasoning, embodied awareness, and interactive, long-horizon memory from partially observable egocentric input, while avoiding low-level image synthesis that could confound the evaluation. We provide a scalable pipeline that synthesizes QA pairs from robotics simulation (BEHAVIOR) and evaluates models on 8,972 QA pairs spanning long-horizon home-scale activities. Experiments reveal a performance gap between frontier VLMs and humans that widens with interaction horizon. Models consistently perform better on the inverse task than the forward one and exhibit anthropocentric biases, including a preference for right-handed actions and degradation when camera intrinsics or viewpoints deviate from human vision. Website at https://enact-embodied-cognition.github.io/.

arxiv preprint arxiv, large language model, machine learning, (19 more...)

arXiv.org Artificial Intelligence

2511.20937

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (0.92)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
(3 more...)

Add feedback

When to Think and When to Look: Uncertainty-Guided Lookback

Bi, Jing, Bellos, Filippos, Guo, Junjia, Li, Yayuan, Huang, Chao, Tang, Yolo Y., Song, Luchuan, Liang, Susan, Zhang, Zhongfei Mark, Corso, Jason J., Xu, Chenliang

arXiv.org Artificial IntelligenceNov-26-2025

Test-time thinking (that is, generating explicit intermediate reasoning chains) is known to boost performance in large language models and has recently shown strong gains for large vision language models (LVLMs). However, despite these promising results, there is still no systematic analysis of how thinking actually affects visual reasoning. We provide the first such analysis with a large scale, controlled comparison of thinking for LVLMs, evaluating ten variants from the InternVL3.5 and Qwen3-VL families on MMMU-val under generous token budgets and multi pass decoding. We show that more thinking is not always better; long chains often yield long wrong trajectories that ignore the image and underperform the same models run in standard instruct mode. A deeper analysis reveals that certain short lookback phrases, which explicitly refer back to the image, are strongly enriched in successful trajectories and correlate with better visual grounding. Building on this insight, we propose uncertainty guided lookback, a training free decoding strategy that combines an uncertainty signal with adaptive lookback prompts and breadth search. Our method improves overall MMMU performance, delivers the largest gains in categories where standard thinking is weak, and outperforms several strong decoding baselines, setting a new state of the art under fixed model families and token budgets. We further show that this decoding strategy generalizes, yielding consistent improvements on five additional benchmarks, including two broad multimodal suites and math focused visual reasoning datasets.

arxiv preprint arxiv, large language model, natural language, (17 more...)

arXiv.org Artificial Intelligence

2511.15613

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.35)

Add feedback

Can a Second-View Image Be a Language? Geometric and Semantic Cross-Modal Reasoning for X-ray Prohibited Item Detection

Peng, Chuang, Tao, Renshuai, Ren, Zhongwei, Liu, Xianglong, Wei, Yunchao

arXiv.org Artificial IntelligenceNov-25-2025

Automatic X-ray prohibited items detection is vital for security inspection and has been widely studied. Traditional methods rely on visual modality, often struggling with complex threats. While recent studies incorporate language to guide single-view images, human inspectors typically use dual-view images in practice. This raises the question: can the second view provide constraints similar to a language modality? In this work, we introduce DualXrayBench, the first comprehensive benchmark for X-ray inspection that includes multiple views and modalities. It supports eight tasks designed to test cross-view reasoning. In DualXrayBench, we introduce a caption corpus consisting of 45,613 dual-view image pairs across 12 categories with corresponding captions. Building upon these data, we propose the Geometric (cross-view)-Semantic (cross-modality) Reasoner (GSR), a multimodal model that jointly learns correspondences between cross-view geometry and cross-modal semantics, treating the second-view images as a "language-like modality". To enable this, we construct the GSXray dataset, with structured Chain-of-Thought sequences: , , . Comprehensive evaluations on DualXrayBench demonstrate that GSR achieves significant improvements across all X-ray tasks, offering a new perspective for real-world X-ray inspection.

large language model, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

2511.18385

Genre: Research Report (0.70)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (0.93)

Add feedback

FastMMoE: Accelerating Multimodal Large Language Models through Dynamic Expert Activation and Routing-Aware Token Pruning

Xia, Guoyang, Ding, Yifeng, Li, Fengfa, Ren, Lei, Chen, Wei, Feng, Fangxiang, Wang, Xiaojie

arXiv.org Artificial IntelligenceNov-25-2025

Multimodal large language models (MLLMs) have achieved impressive performance, but high-resolution visual inputs result in long sequences of visual tokens and substantial inference latency. Reducing redundant visual tokens is critical to ease computational/memory burdens while preserving performance, enabling MLLM deployment in resource-constrained or latency-sensitive scenarios. Current visual token pruning methods mainly rely on attention-based redundancy analysis and are tailored to dense architectures. We propose Fast Multimodal Mixture-of-Experts (FastMMoE), a training-free acceleration framework for mixture-of-experts (MoE) based MLLMs, developed from a routing analysis perspective. FastMMoE combines two complementary strategies: (i) expert activation reduction for visual tokens to minimize unnecessary expert computation; and (ii) routing-aware token pruning that leverages similarity in routing probability distributions to identify and remove highly redundant visual tokens. Experiments on large-scale MoE-MLLMs such as DeepSeek-VL2 and InternVL3.5 demonstrate that FastMMoE can reduce FLOPs by up to 55.0% while retaining approximately 95.5% of the original performance, consistently outperforming dense-model pruning baselines including FastV and SparseVLM across multiple retention rates.

artificial intelligence, large language model, natural language, (20 more...)

arXiv.org Artificial Intelligence

2511.17885

Genre: Research Report > New Finding (0.46)

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

Add feedback

Elicit and Enhance: Advancing Multimodal Reasoning in Medical Scenarios

Huang, Zhongzhen, Mu, Linjie, Zhu, Yakun, Zhao, Xiangyu, Zhang, Shaoting, Zhang, Xiaofan

arXiv.org Artificial IntelligenceNov-4-2025

Effective clinical decision-making depends on iterative, multimodal reasoning across diverse sources of evidence. The recent emergence of multimodal reasoning models has significantly transformed the landscape of solving complex tasks. Although such models have achieved notable success in mathematics and science, their application to medical domains remains underexplored. In this work, we propose \textit{MedE$^2$}, a two-stage post-training pipeline that elicits and then enhances multimodal reasoning for medical domains. In Stage-I, we fine-tune models using 2,000 text-only data samples containing precisely orchestrated reasoning demonstrations to elicit reasoning behaviors. In Stage-II, we further enhance the model's reasoning capabilities using 1,500 rigorously curated multimodal medical cases, aligning model reasoning outputs with our proposed multimodal medical reasoning preference. Extensive experiments demonstrate the efficacy and reliability of \textit{MedE$^2$} in improving the reasoning performance of medical multimodal models. Notably, models trained with \textit{MedE$^2$} consistently outperform baselines across multiple medical multimodal benchmarks. Additional validation on larger models and under inference-time scaling further confirms the robustness and practical utility of our approach.

arxiv preprint arxiv, large language model, machine learning, (20 more...)

arXiv.org Artificial Intelligence

2505.23118

Country: Asia (0.28)

Genre: Research Report > New Finding (0.68)

Industry: Health & Medicine > Diagnostic Medicine > Imaging (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
(2 more...)

Add feedback