AITopics

Country: Asia > China (0.05)

Industry: Law (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.46)

Egor Burkov, Victor Lempitsky

Deep Neural Networks with Box Convolutions

Neural Information Processing SystemsFeb-13-2026, 16:56:59 GMT

Due to its ability to integrate information over large boxes, the new layer facilitates long-range propagation of information and leads to the efficient increase ofthe receptivefields ofnetwork units.

artificial intelligence, convolution, machine learning, (17 more...)

Country:

North America > Canada > Quebec > Montreal (0.04)
Europe > Russia > Central Federal District > Moscow Oblast > Moscow (0.04)
Asia > Russia (0.04)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.47)

Neural Information Processing SystemsNov-21-2025, 08:44:11 GMT

Learning What and Where to Draw

Scott E. Reed, Zeynep Akata, Santosh Mohan, Samuel Tenka, Bernt Schiele, Honglak Lee

Neural Information Processing Systems http://nips.cc/

artificial intelligence, machine learning, natural language, (19 more...)

Country:

North America > United States > Michigan > Washtenaw County > Ann Arbor (0.04)
Europe > Spain > Catalonia > Barcelona Province > Barcelona (0.04)
Europe > Germany > Saarland > Saarbrücken (0.04)

Industry: Leisure & Entertainment > Sports (0.47)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.94)
Information Technology > Artificial Intelligence > Natural Language (0.94)
(2 more...)

Egor Burkov, Victor Lempitsky

Deep Neural Networks with Box Convolutions

Neural Information Processing SystemsNov-20-2025, 18:16:41 GMT

artificial intelligence, deep learning, machine learning, (20 more...)

Country:

Asia > Russia (0.14)
North America > Canada > Quebec > Montreal (0.04)
Europe > Russia > Central Federal District > Moscow Oblast > Moscow (0.04)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.82)

Neural Information Processing SystemsOct-11-2025, 00:26:23 GMT

Appendix A

dataset, instruction, phrase respond, (17 more...)

Country: Asia > China (0.05)

Industry: Law (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.46)

Kreis, Marten, Kiefer, Benjamin

Real-Time Fusion of Visual and Chart Data for Enhanced Maritime Vision

arXiv.org Artificial IntelligenceJul-21-2025

This paper presents a novel approach to enhancing marine vision by fusing real-time visual data with chart information. Our system overlays nautical chart data onto live video feeds by accurately matching detected navigational aids, such as buoys, with their corresponding representations in chart data. T o achieve robust association, we introduce a transformer-based end-to-end neural network that predicts bounding boxes and confidence scores for buoy queries, enabling the direct matching of image-domain detections with world-space chart markers. The proposed method is compared against baseline approaches, including a ray-casting model that estimates buoy positions via camera projection and a YOLOv7-based network extended with a distance estimation module. Experimental results on a dataset of real-world maritime scenes demonstrate that our approach significantly improves object localization and association accuracy in dynamic and challenging environments.

artificial intelligence, machine learning, natural language, (19 more...)

2507.1388

Country:

North America > United States (0.46)
Europe > Switzerland (0.28)

Genre: Research Report > Promising Solution (0.34)

Industry:

Government (0.46)
Transportation (0.46)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
(2 more...)

arXiv.org Artificial IntelligenceMay-23-2025

GRIT: Teaching MLLMs to Think with Images

Fan, Yue, He, Xuehai, Yang, Diji, Zheng, Kaizhi, Kuo, Ching-Chen, Zheng, Yuting, Narayanaraju, Sravana Jyothi, Guan, Xinze, Wang, Xin Eric

Recent studies have demonstrated the efficacy of using Reinforcement Learning (RL) in building reasoning models that articulate chains of thoughts prior to producing final answers. However, despite ongoing advances that aim at enabling reasoning for vision-language tasks, existing open-source visual reasoning models typically generate reasoning content with pure natural language, lacking explicit integration of visual information. This limits their ability to produce clearly articulated and visually grounded reasoning chains. To this end, we propose Grounded Reasoning with Images and Texts (GRIT), a novel method for training MLLMs to think with images. GRIT introduces a grounded reasoning paradigm, in which models generate reasoning chains that interleave natural language and explicit bounding box coordinates. These coordinates point to regions of the input image that the model consults during its reasoning process. Additionally, GRIT is equipped with a reinforcement learning approach, GRPO-GR, built upon the GRPO algorithm. GRPO-GR employs robust rewards focused on the final answer accuracy and format of the grounded reasoning output, which eliminates the need for data with reasoning chain annotations or explicit bounding box labels. As a result, GRIT achieves exceptional data efficiency, requiring as few as 20 image-question-answer triplets from existing datasets. Comprehensive evaluations demonstrate that GRIT effectively trains MLLMs to produce coherent and visually grounded reasoning chains, showing a successful unification of reasoning and grounding abilities.

large language model, machine learning, reinforcement learning, (21 more...)

2505.15879

Genre: Research Report > New Finding (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
(3 more...)

Deperrois, Nicolas, Matsuo, Hidetoshi, Ruipérez-Campillo, Samuel, Vandenhirtz, Moritz, Laguna, Sonia, Ryser, Alain, Fujimoto, Koji, Nishio, Mizuho, Sutter, Thomas M., Vogt, Julia E., Kluckert, Jonas, Frauenfelder, Thomas, Blüthgen, Christian, Nooralahzadeh, Farhad, Krauthammer, Michael

RadVLM: A Multitask Conversational Vision-Language Model for Radiology

arXiv.org Artificial IntelligenceFeb-5-2025

X-rays have played a fundamental role in medicine since their discovery in 1895 (Röntgen, 1895), and continue to be the most frequently used medical imaging modality worldwide due to their convenience and cost-effectiveness (Akhter et al., 2023). Chest X-ray (CXR) remains the most commonly performed radiological exam globally, particularly important for diagnosing and monitoring thoracic conditions such as pneumonia, heart failure, and lung cancer (Çallı et al., 2021). Problematically, the growing volume of CXRs and other imaging studies in recent years have lead to a reduction in the time available for radiologists to thoroughly evaluate each case (Peng et al., 2022). As a result, in many countries, the responsibility of interpreting CXRs is often transferred to non-radiology physicians, who typically possess less specialized training and experience. This shift increases the risk of diagnostic errors or misinterpretations (Shammari et al., 2021; Peng et al., 2022). The shortage of trained personnel for CXR interpretation has led to the exploration of automated agents to assist physicians in diagnostic tasks. In recent years, various deep learning models have shown promise in clinical applications, such as the detection of conditions like COVID-19 pneumonia (Nishio et al., 2020) or pulmonary nodules (Homayounieh et al., 2021). Another extensively studied task is the automated generation of free text reports from CXR images using transformer-based architectures (Nooralahzadeh et al., 2021; Yang et al., 2023; Hyland et al., 2023; Chaves et al., 2024). These models can provide preliminary drafts summarizing key observations from the CXR, offering a potential enhancement to the diagnostic workflow.

arxiv preprint arxiv, large language model, machine learning, (18 more...)

2502.03333

Country:

Europe > Switzerland > Zürich > Zürich (0.14)
North America > United States (0.04)
Asia > Japan > Honshū > Kansai > Kyoto Prefecture > Kyoto (0.04)
(5 more...)

Genre: Research Report > New Finding (0.93)

Industry:

Health & Medicine > Therapeutic Area (1.00)
Health & Medicine > Diagnostic Medicine > Imaging (1.00)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
(2 more...)

arXiv.org Artificial IntelligenceJan-13-2025

Migician: Revealing the Magic of Free-Form Multi-Image Grounding in Multimodal Large Language Models

Li, You, Huang, Heyu, Chen, Chi, Huang, Kaiyu, Huang, Chao, Guo, Zonghao, Liu, Zhiyuan, Xu, Jinan, Li, Yuhua, Li, Ruixuan, Sun, Maosong

The recent advancement of Multimodal Large Language Models (MLLMs) has significantly improved their fine-grained perception of single images and general comprehension across multiple images. However, existing MLLMs still face challenges in achieving precise grounding in complex multi-image scenarios. To address this, we first explore a Chain-of-Thought (CoT) framework that integrates single-image grounding with multi-image comprehension. While partially effective, it remains unstable and struggles to capture abstract visual information due to its non-end-to-end nature. Therefore, we introduce Migician, the first multi-image grounding model capable of performing free-form and accurate grounding across multiple images. To support this, we present the MGrounding-630k dataset, which comprises data for several multi-image grounding tasks derived from existing datasets, along with newly generated free-form grounding instruction-following data. Furthermore, we propose MIG-Bench, a comprehensive benchmark specifically designed for evaluating multi-image grounding capabilities. Experimental results demonstrate that our model achieves significantly superior multi-image grounding capabilities, outperforming the best existing MLLMs by 21.61% and even surpassing much larger 70B models. Our code, model, dataset, and benchmark are fully open-sourced at https://migician-vg.github.io/.

grounding, large language model, natural language, (16 more...)

2501.05767

Country: Asia > China (0.28)

Genre:

Research Report (0.70)
Workflow (0.69)

Industry: Transportation > Ground > Road (0.46)

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.85)

Corbière, Charles, Roburin, Simon, Montariol, Syrielle, Bosselut, Antoine, Alahi, Alexandre

DRIVINGVQA: Analyzing Visual Chain-of-Thought Reasoning of Vision Language Models in Real-World Scenarios with Driving Theory Tests

arXiv.org Artificial IntelligenceJan-8-2025

Large vision-language models (LVLMs) augment language models with visual understanding, enabling multimodal reasoning. However, due to the modality gap between textual and visual data, they often face significant challenges, such as over-reliance on text priors, hallucinations, and limited capacity for complex visual reasoning. Existing benchmarks to evaluate visual reasoning in LVLMs often rely on schematic or synthetic images and on imprecise machine-generated explanations. To bridge the modality gap, we present DrivingVQA, a new benchmark derived from driving theory tests to evaluate visual chain-of-thought reasoning in complex real-world scenarios. It offers 3,931 expert-crafted multiple-choice problems and interleaved explanations grounded with entities relevant to the reasoning process. We leverage this dataset to perform an extensive study of LVLMs' ability to reason about complex visual scenarios. Our experiments reveal that open-source and proprietary LVLMs struggle with visual chain-of-thought reasoning under zero-shot settings. We investigate training strategies that leverage relevant entities to improve visual reasoning. Notably, we observe a performance boost of up to 7\% when reasoning over image tokens of cropped regions tied to these entities.

artificial intelligence, large language model, natural language, (18 more...)