AITopics | Large Language Model

Collaborating Authors

Large Language Model

News Overviews Instructional Materials AI-Alerts Classics

6ebb92aad3a4fe7aae230b0e63c2ef35-Paper-Conference.pdf

Neural Information Processing SystemsJun-18-2026, 07:24:26 GMT

Recent advances in multimodal models have raised questions about whether visionand-language models (VLMs) integrate cross-modal information in ways that reflect human cognition. One well-studied test case in this domain is the boubakiki effect, where humans reliably associate pseudowords like'bouba' with round shapes and'kiki' with jagged ones. Given the mixed evidence found in prior studies for this effect in VLMs, we present a comprehensive re-evaluation focused on two variants of CLIP, ResNet and Vision Transformer (ViT), given their centrality in many state-of-the-art VLMs. We apply two complementary methods closely modelled after human experiments: a prompt-based evaluation that uses probabilities as a measure of model preference, and we use Grad-CAM as a novel approach to interpret visual attention in shape-word matching tasks. Our findings show that these model variants do not consistently exhibit the bouba-kiki effect. While ResNet shows a preference for round shapes, overall performance across both model variants lacks the expected associations. Moreover, direct comparison with prior human data on the same task shows that the models' responses fall markedly short of the robust, modality-integrated behaviour characteristic of human cognition. These results contribute to the ongoing debate about the extent to which VLMs truly understand cross-modal concepts, highlighting limitations in their internal representations and alignment with human intuitions.

large language model, machine learning, natural language, (19 more...)

Neural Information Processing Systems

Country:

North America > United States (1.00)
Europe (1.00)
Asia > Middle East > UAE (0.46)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (1.00)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Cognitive Science (1.00)
(2 more...)

Add feedback

MUSTAFAR: Promoting Unstructured Sparsity for KVCache Pruning in LLMInference

Neural Information Processing SystemsJun-18-2026, 07:22:33 GMT

We demonstrate that unstructured sparsity significantly improves KV cache compression for LLMs, enabling sparsity levels up to 70% without compromising accuracy or requiring fine-tuning. We conduct a systematic exploration of pruning strategies and find per-token magnitude-based pruning as highly effective for both Key and Value caches under unstructured sparsity, surpassing prior structured pruning schemes. The Key cache benefits from prominent outlier elements, while the Value cache surprisingly benefits from a simple magnitude-based pruning despite its uniform distribution. KV cache size is the major bottleneck in decode performance due to high memory overhead for large context lengths. To address this, we use a bitmap-based sparse format and a custom attention kernel capable of compressing and directly computing over compressed caches pruned to arbitrary sparsity patterns, significantly accelerating memory-bound operations in decode computations and thereby compensating for the overhead of runtime pruning and compression. Our custom attention kernel coupled with the bitmap-based format delivers substantial compression of KV cache up to 45% of dense inference and thereby enables longer context lengths and increased tokens/sec throughput of up to 2.23 compared to dense inference.

large language model, machine learning, pruning, (22 more...)

Neural Information Processing Systems

Country:

North America > United States (0.67)
Europe (0.46)

Genre: Research Report > Experimental Study (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.96)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.96)

Add feedback

SOFAR: Language-Grounded Orientation Bridges Spatial Reasoningand Object Manipulation

Neural Information Processing SystemsJun-18-2026, 07:22:28 GMT

While spatial reasoning has made progress in object localization relationships, it often overlooks object orientation--a key factor in 6-DoF fine-grained manipulation. Traditional pose representations rely on pre-defined frames or templates, limiting generalization and semantic grounding. In this paper, we introduce the concept of semantic orientation, which defines object orientations using natural language in a reference-frame-free manner (e.g., the "plug-in" direction of a USB or the "handle" direction of a cup). To support this, we construct OrienText300K, a large-scale dataset of 3D objects annotated with semantic orientations, and develop PointSO, a general model for zero-shot semantic orientation prediction. By integrating semantic orientation into VLM agents, our SOFAR framework enables 6-DoF spatial reasoning and generates robotic actions. Extensive experiments demonstrated the effectiveness and generalization of our SOFAR, e.g., zero-shot 48.7% successful rate on Open6DOR and zero-shot 74.9% successful rate on SIMPLER-Env.

large language model, machine learning, natural language, (20 more...)

Neural Information Processing Systems

Country:

North America > United States (1.00)
Europe (1.00)
Asia (1.00)

Genre:

Research Report > Experimental Study (1.00)
Research Report > New Finding (0.87)

Industry:

Leisure & Entertainment (0.67)
Information Technology (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.93)

Add feedback

Learning Efficient Fuse-and-Refine for Feed-Forward 3DGaussian Splatting

Neural Information Processing SystemsJun-18-2026, 07:18:15 GMT

Recent advances in feed-forward 3DGaussian Splatting have led to rapid improvements in efficient scene reconstruction from sparse views. However, most existing approaches construct Gaussian primitives directly aligned with the pixels in one or more of the input images. This leads to redundancies in the representation when input views overlap and constrains the position of the primitives to lie along the input rays without full flexibility in 3D space. Moreover, these pixel-aligned approaches do not naturally generalize to dynamic scenes, where effectively leveraging temporal information requires resolving both redundant and newly appearing content across frames. To address these limitations, we introduce a novel Fuseand-Refine module that enhances existing feed-forward models by merging and refining the primitives in a canonical 3D space.

large language model, machine learning, natural language, (18 more...)

Neural Information Processing Systems

Genre:

Research Report > Experimental Study (1.00)
Research Report > New Finding (0.67)

Industry: Information Technology (0.92)

Technology:

Information Technology > Sensing and Signal Processing > Image Processing (1.00)
Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.68)
(2 more...)

Add feedback

MuRating: AHigh Quality Data Selecting Approach to Multilingual Large Language Model Pretraining

Neural Information Processing SystemsJun-18-2026, 07:06:31 GMT

Data quality is a critical driver of large language model performance, yet existing model-based selection methods focus almost exclusively on English, neglecting other languages that are essential in the training mix for multilingual LLMs. We introduce MuRating, a scalable framework that transfers high-quality English dataquality signals into a multilingual autorater, capable of handling 17 languages. MuRating aggregates multiple English autoraters via pairwise comparisons to learn unified document quality scores, then projects these judgments through translation to train a multilingual evaluator on monolingual, cross-lingual, and parallel text pairs. Applied to web data, MuRating selects balanced subsets of English and multilingual content to pretrain LLaMA-architecture models of 1.2B and 7B parameters. Compared to strong baselines, including QuRater, FineWeb2HQ, AskLLM, DCLM, our approach increases average accuracy on both English benchmarks and multilingual evaluations. Extensive analyses further validate that pairwise training provides greater stability and robustness than pointwise scoring, underscoring the effectiveness of MuRating as a general multilingual data-selection framework.

large language model, machine learning, murater, (17 more...)

Neural Information Processing Systems

Country:

Asia (0.93)
North America > Mexico (0.28)

Genre:

Research Report > Experimental Study (1.00)
Research Report > New Finding (0.93)

Industry: Education (0.93)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Impromptu Mapillary ONCENA

Neural Information Processing SystemsJun-18-2026, 07:05:05 GMT

Dataset achieve significant performance improvements in both closed-loop and open-loop metrics.

large language model, machine learning, natural language, (22 more...)

Neural Information Processing Systems

Country: Asia (0.28)

Genre: Research Report > Experimental Study (1.00)

Industry:

Transportation > Ground > Road (1.00)
Automobiles & Trucks (1.00)
Information Technology (0.70)
Transportation > Infrastructure & Services (0.69)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
(5 more...)

Add feedback

VideoHallu: Evaluating and Mitigating Multi-modal Hallucinations on Synthetic Video Understanding

Neural Information Processing SystemsJun-18-2026, 06:55:21 GMT

Vision-Language Models (VLMs) have achieved remarkable success in video understanding tasks. Yet, a key question remains: do they comprehend visual information, or merely learn superficial mappings between visual and textual patterns? Understanding visual cues, particularly those related to physics and common sense, is crucial for AI systems interacting with the physical world. However, existing VLM evaluations primarily rely on positivecontrol tests using real-world videos that resemble training distributions. While VLMs perform well on such benchmarks, it is unclear whether they grasp underlying visual and contextual signals or simply exploit visual-language correlations. To fill this gap, we propose incorporating negative-control tests, i.e., videos depicting physically impossible or logically inconsistent scenarios, and evaluating whether models can recognize these violations.

large language model, machine learning, natural language, (20 more...)

Neural Information Processing Systems

Country: North America > United States (0.67)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (1.00)

Industry: Education (0.46)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
(2 more...)

Add feedback

From Sequence to Structure: Uncovering Substructure Reasoning in Transformers

Neural Information Processing SystemsJun-18-2026, 06:45:37 GMT

Recent studies suggest that large language models (LLMs) possess the capability to solve graph reasoning tasks. Notably, even when graph structures are embedded within textual descriptions, LLMs can still effectively answer related questions. This raises a fundamental question: How can a decoder-only Transformer architecture understand underlying graph structures? To address this, we start with the substructure extraction task, interpreting the inner mechanisms inside the transformers and analyzing the impact of the input queries.

large language model, machine learning, natural language, (18 more...)

Neural Information Processing Systems

Country: North America > United States (0.28)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (1.00)

Industry: Government (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Supplementary Materials AGMMU: AComprehensive Agricultural Multimodal Understanding Benchmark Aruna Gauba1,2,5 Irene Pi1,3,5 Yunze Man1,4,5 Ziqi Pang1,4,5 Vikram S. Adve1,4,5 Yu-Xiong Wang1,4,5

Neural Information Processing SystemsJun-18-2026, 05:58:16 GMT

Our evaluation and analysis are conducted mainly on the group of models listed in Table 2 in the13 main paper. We have chosen models such that they cover most of the popular and best-performing14 methods used by recent multimodal understanding work. In this part, we discuss all the models we15 have used in our experiments and explain their evaluation details, the public checkpoints we have16 chosen, and display the prompts we used to adapt the model to our datasets.17 During evaluation, we chose to follow the standard prompt provided by the authors whenever possi-18 ble for multiple-choice and short-answer questions. When the prompt is not provided for the model,19 we select a custom prompt that is created through several iterations of prompt engineering to select20 the one that produces the most effective results. The images are always included as the prefix.21 We used three proprietary models in our evaluation: GPT-o4-mini [1], Gem-22 ini 1.5 Pro [9], and Claude 3 Haiku [10]. Below we note the model API version used for evaluation.23 GPT-o4-mini: May 13-15, 2025.24 Cambrian-1 is a recent state-of-the-art model that excels at visual-centric tasks.27 This model explores combinations of vision encoders, text and image integration techniques, and28 instruction tuning strategies. We use the official implementation and checkpoint1 with a LLaMA3-29 8B-Instruct LLM backbone model in our evaluation.30 InternVL scales up the vision foundation model while aligning it with the back-31 bone LLM, and is trained on web-scale image-text data to achieve strong performance across a vari-32 ety of vision-centric tasks. We use the official implementation and checkpoint2 with the InternViT-33 300M-448px vision backbone and Internlm2.5-7B-chat LLaMA-3.2 is the first collection of multimodal large language model from the35 LLaMA family that was previously text-only. The integration of vision involves utilizing cross-36 attention layers and a pre-trained vision encoder that feeds directly into the text-processor. The37 model follows a commonly used training recipe that includes pretraining on noisy image-text pairs38 and then high-quality knowledge enhanced pairs. Notably, the language-model parameters were39 frozen during the training of alignment of image and text to retain strong text-only capabilities. We40 use the official implementation and checkpoint3 that uses a LLaMA-3.1 text-only language backbone41 in our evaluation. When evaluating the model, we choose to use a custom prompt since no standard42 prompt is provided.43

large language model, machine learning, natural language, (19 more...)

Neural Information Processing Systems

Country: North America > United States (1.00)

Genre: Research Report > New Finding (0.34)

Industry:

Food & Agriculture > Agriculture > Pest Control (0.70)
Materials > Chemicals > Agricultural Chemicals (0.70)
Health & Medicine > Therapeutic Area > Infections and Infectious Diseases (0.68)
Education (0.66)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

AGMMU: AComprehensive Agricultural Multimodal Understanding Benchmark

Neural Information Processing SystemsJun-18-2026, 05:58:14 GMT

Unlike prior datasets that rely on crowdsourced prompts, AGMMU is distilled from 116,231 authentic dialogues between everyday growers and USDAauthorized Cooperative Extension experts. Through a three-stage pipeline: automated knowledge extraction, QA generation, and human verification, we construct (i) AGMMU, an evaluation set of 746 multiple-choice questions (MCQs) and 746 open-ended questions (OEQs), and (ii) AGBASE, a development corpus of 57,079 multimodal facts covering five high-stakes agricultural topics: insect identification, species identification, disease categorization, symptom description, and management instruction. AGMMU has three key advantages: Authentic & Expert-Verified: All facts, images, and answers originate from real farmer and gardener inquiries answered by credentialed specialists, ensuring high-fidelity agricultural knowledge. Complete Development Suite: AGMMU uniquely couples a dual-format evaluation benchmark (MCQ and OEQ) with AGBASE, a large-scale training set, enabling both rigorous assessment and targeted improvement of VLMs. Knowledge-intensive Challenge: Our tasks demand the synergy of nuanced visual perception and domain expertise, exposing fundamental limitations of current general-purpose models and charting a path toward robust, application-ready agricultural AI. Benchmarking 12 leading VLMs reveals pronounced gaps in fine-grained perception and factual grounding. Open-sourced models trail after proprietary ones by a wide margin. Simple fine-tuning on AGBASE boosts open-sourced model performance on challenging OEQs for up to 11.6% on average, narrowing this gap and also motivating future research to propose better strategies in knowledge extraction and distillation from AGBASE. We hope AGMMU stimulates research on domain-specific knowledge integration and trustworthy decision support in agriculture AI development.

large language model, machine learning, natural language, (20 more...)

Neural Information Processing Systems

Country: North America > United States (1.00)

Genre: