AITopics | Large Language Model

Collaborating Authors

Large Language Model

News Overviews Instructional Materials AI-Alerts Classics

Worse than Zero shot Checking for Evaluating the Robustness of Misleading Retrievals

Neural Information Processing SystemsJun-23-2026, 01:47:17 GMT

Retrieval-augmented generation (RAG) has shown impressive capabilities in mitigating hallucinations in large language models (LLMs). However, LLMs struggle to maintain consistent reasoning when exposed to misleading or conflicting evidence, especially in real-world domains such as politics, where information is polarized or selectively framed. Mainstream RAG benchmarks evaluate models under clean retrieval settings, where systems generate answers from gold-standard documents, or under synthetically perturbed settings, where documents are artificially injected with noise. These assumptions fail to reflect real-world conditions, often leading to an overestimation of RAG system performance. To address this gap, we introduce RAGUARD, the first benchmark to evaluate the robustness of RAG systems against misleading retrievals.

large language model, machine learning, natural language, (17 more...)

Neural Information Processing Systems

Country:

Asia (1.00)
North America > United States > California (0.93)

Genre:

Research Report > Experimental Study (1.00)
Research Report > New Finding (0.93)
Overview (0.68)

Industry:

Government > Regional Government > North America Government > United States Government (1.00)
Banking & Finance > Economy (1.00)
Media > News (0.69)
(2 more...)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

SEMPO: Lightweight Foundation Models for Time Series Forecasting

Neural Information Processing SystemsJun-23-2026, 01:44:43 GMT

Despite impressive performance across diverse downstream forecasting tasks, existing time series FMs possess massive network architectures and require substantial pre-training on large-scale datasets, which significantly hinders their deployment in resourceconstrained environments. In response to this growing tension between versatility and affordability, we propose SEMPO, a novel lightweight foundation model that requires pretraining on relatively small-scale data, yet exhibits strong general time series forecasting. Concretely, SEMPO comprises two key modules: 1) energyaware SpEctral decomposition module, that substantially improves the utilization of pre-training data by modeling not only the high-energy frequency signals but also the low-energy yet informative frequency signals that are ignored in current methods; and 2) Mixture-of-PrOmpts enabled Transformer, that learns heterogeneous temporal patterns through small dataset-specific prompts and adaptively routes time series tokens to prompt-based experts for parameter-efficient model adaptation across different datasets and domains. Equipped with these modules, SEMPO significantly reduces both pre-training data scale and model size, while achieving strong generalization. Extensive experiments on two large-scale benchmarks covering 16 datasets demonstrate the superior performance of SEMPO in both zero-shot and few-shot forecasting scenarios compared with state-of-the-art methods. Code and data are available at https://github.com/mala-lab/SEMPO.

large language model, machine learning, natural language, (21 more...)

Neural Information Processing Systems

Country:

Asia (0.28)
North America > United States > California (0.27)

Genre: Research Report > Experimental Study (1.00)

Industry:

Government (0.67)
Energy > Power Industry (0.67)

Technology:

Information Technology > Data Science (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

SCOPE: Saliency-Coverage Oriented Token Pruning for Efficient Multimodel LLMs

Neural Information Processing SystemsJun-23-2026, 01:43:06 GMT

Multimodal Large Language Models (MLLMs) typically process a large number of visual tokens, leading to considerable computational overhead, even though many of these tokens are redundant. Existing visual token pruning methods primarily focus on selecting the most salient tokens based on attention scores, resulting in the semantic incompleteness of the selected tokens. In this paper, we propose a novel visual token pruning strategy, called Saliency-Coverage Oriented token Pruning for Efficient MLLMs (SCOPE), to jointly model both the saliency and coverage of the selected visual tokens to better preserve semantic completeness. Specifically, we introduce a set-coverage for a given set of selected tokens, computed based on the token relationships. We then define a token-coverage gain for each unselected token, quantifying how much additional coverage would be obtained by including it. By integrating the saliency score into the token-coverage gain, we propose our SCOPE score and iteratively select the token with the highest SCOPE score. We conduct extensive experiments on multiple vision-language understanding benchmarks using the LLaVA-1.5 and LLaVA-Next models. Experimental results demonstrate that our method consistently outperforms prior approaches. Our code is available at https://github.com/kinredon/SCOPE.

benchmark, large language model, machine learning, (18 more...)

Neural Information Processing Systems

Country: North America > United States > Minnesota (0.28)

Genre:

Research Report > Experimental Study (1.00)
Research Report > New Finding (0.87)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.93)

Add feedback

Robot-R1: Reinforcement Learning for Enhanced Embodied Reasoning in Robotics

Neural Information Processing SystemsJun-23-2026, 01:42:47 GMT

Large Vision-Language Models (LVLMs) have recently shown great promise in advancing robotics by combining embodied reasoning with robot control. A common approach involves training on embodied reasoning tasks related to robot control using Supervised Fine-Tuning (SFT). However, SFT datasets are often heuristically constructed and not explicitly optimized for improving robot control. Furthermore, SFT often leads to issues such as catastrophic forgetting and reduced generalization performance. To address these limitations, we introduce ROBOT-R1, a novel framework that leverages reinforcement learning to enhance embodied reasoning specifically for robot control.

arxiv preprint arxiv, large language model, machine learning, (21 more...)

Neural Information Processing Systems

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (1.00)

Industry: Education (0.46)

Technology:

Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
(3 more...)

Add feedback

Dynamic Focused Masking for Embodied Occupancy Prediction

Neural Information Processing SystemsJun-23-2026, 01:42:35 GMT

Visual autoregressive modeling has recently demonstrated potential in image tasks by enabling coarse-to-fine, next-level prediction. Most indoor 3D occupancy prediction methods, however, continue to rely on dense voxel grids and convolution-heavy backbones, which incur high computational costs when applying such coarse-tofine frameworks. In contrast, cost-efficient alternatives based on Gaussian representations--particularly in the context of multi-scale autoregression--remain underexplored. To bridge this gap, we propose DFGauss, a Dynamic Focused masking framework for multi-scale 3DGaussian representation. Unlike conventional approaches that refine voxel volumes or 2D projections, DFGauss directly operates in the 3DGaussian parameter space, progressively refining representations across resolutions under hierarchical supervision. Each finer-scale Gaussian is conditioned on its coarser-level counterpart, forming a scale-wise autoregressive process. To further enhance efficiency, we introduce an importance-guided refinement strategy that selectively propagates informative Gaussians across scales, enabling spatially adaptive detail modeling. Experiments on 3D occupancy benchmarks demonstrate that DFGauss achieves competitive performance, highlighting the promise of autoregressive modeling for scalable 3D occupancy prediction.

large language model, machine learning, natural language, (16 more...)

Neural Information Processing Systems

Country: North America > United States > New Jersey (0.28)

Genre: Research Report > Experimental Study (1.00)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Robots (0.68)
(2 more...)

Add feedback

AMulti-Task Benchmark for Abusive Language Detection in Low-Resource Settings

Neural Information Processing SystemsJun-23-2026, 01:42:08 GMT

Content moderation research has recently made significant advances, but remains limited in serving the majority of the world's languages due to the lack of resources, leaving millions of vulnerable users to online hostility. This work presents a large-scale human-annotated multi-task benchmark dataset for abusive language detection in Tigrinya social media with joint annotations for three tasks: abusiveness, sentiment, and topic classification. The dataset comprises 13,717 YouTube comments annotated by nine native speakers, collected from 7,373 videos with a total of over 1.2 billion views across 51 channels. We developed an iterative term clustering approach for effective data selection. Recognizing that around 64% of Tigrinya social media content uses Romanized transliterations rather than native Ge'ez script, our dataset accommodates both writing systems to reflect actual language use. We establish strong baselines across the tasks in the benchmark, while leaving significant challenges for future contributions. Our experiments demonstrate that small fine-tuned models outperform prompted frontier large language models (LLMs) in the low-resource setting, achieving 86.67% F1 in abusiveness detection (7+ points over best LLM), and maintain stronger performance in all other tasks. The benchmark is made public to promote research on online safety.1

large language model, machine learning, natural language, (21 more...)

Neural Information Processing Systems

Country:

North America > United States > Minnesota (0.28)
Europe > Middle East > Malta (0.28)
Asia > Middle East > UAE (0.28)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (1.00)

Industry:

Media (0.46)
Information Technology > Security & Privacy (0.46)
Law (0.46)
Government (0.46)

Technology:

Information Technology > Communications > Social Media (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Can LLMs Correct Themselves of Self Correction in LLMs

Neural Information Processing SystemsJun-23-2026, 01:42:00 GMT

Self-correction of large language models (LLMs) emerges as a critical component for enhancing their reasoning performance. Although various self-correction methods have been proposed, a comprehensive evaluation of these methods remains largely unexplored, and the question of whether LLMs can truly correct themselves is a matter of significant interest and concern. In this study, we introduce CorrectBench, a benchmark developed to evaluate the effectiveness of self-correction strategies, including intrinsic, external, and fine-tuned approaches, across three tasks: commonsense reasoning, mathematical reasoning, and code generation. Our findings reveal that: 1) Self-correction methods can improve accuracy, especially for complex reasoning tasks; 2) Mixing different self-correction strategies yields further improvements, though it reduces efficiency; 3) Reasoning LLMs have limited optimization under additional self-correction methods and have high time costs. Interestingly, a comparatively simple chain-of-thought (CoT) baseline demonstrates competitive accuracy and efficiency. These results underscore the potential of self-correction to enhance LLM's reasoning performance while highlighting the ongoing challenge of improving their efficiency. Consequently, we advocate for further research focused on optimizing the balance between reasoning capabilities and operational efficiency.

large language model, machine learning, natural language, (18 more...)

Neural Information Processing Systems

Country: North America > United States (1.00)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (1.00)

Industry:

Government (1.00)
Information Technology (0.67)
Media > Film (0.45)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Ground-Compose-Reinforce: Grounding Language in Agentic Behaviours using Limited Data

Neural Information Processing SystemsJun-23-2026, 01:32:11 GMT

Grounding language in perception and action is a key challenge when building situated agents that can interact with humans, or other agents, via language. In the past, addressing this challenge has required manually designing the language grounding or curating massive datasets that associate language with the environment. We propose Ground-Compose-Reinforce, an end-to-end, neurosymbolic framework for training RL agents directly from high-level task specifications-- without manually designed reward functions or other domain-specific oracles, and without massive datasets. These task specifications take the form of Reward Machines, automata-based representations that capture high-level task structure and are in some cases autoformalizable from natural language. Critically, we show that Reward Machines can be grounded using limited data by exploiting compositionality. Experiments in a custom Meta-World domain with only 350 labelled pretraining trajectories show that our framework faithfully elicits complex behaviours from high-level specifications--including behaviours that never appear in pretraining--while non-compositional approaches fail.

large language model, machine learning, reinforcement learning, (19 more...)

Neural Information Processing Systems

Country: North America > Canada > Ontario (0.28)

Genre:

Research Report > Experimental Study (1.00)
Overview (0.67)

Industry:

Education (0.46)
Government (0.45)

Technology:

Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
(2 more...)

Add feedback

VMDT: Decoding the Trustworthiness of Video Foundation Models

Neural Information Processing SystemsJun-23-2026, 01:31:42 GMT

As foundation models become more sophisticated, ensuring their trustworthiness becomes increasingly critical; yet, unlike text and image, the video modality still lacks comprehensive trustworthiness benchmarks. We introduce VMDT (VideoModal DecodingTrust), the first unified platform for evaluating text-to-video (T2V) and video-to-text (V2T) models across five key trustworthiness dimensions: safety, hallucination, fairness, privacy, and adversarial robustness. Through our extensive evaluation of 7 T2V models and 19 V2T models using VMDT, we uncover several significant insights. For instance, all open-source T2V models evaluated fail to recognize harmful queries and often generate harmful videos, while exhibiting higher levels of unfairness compared to image modality models. In V2T models, unfairness and privacy risks rise with scale, whereas hallucination and adversarial robustness improve--though overall performance remains low. Uniquely, safety shows no correlation with model size, implying that factors other than scale govern current safety levels. Our findings highlight the urgent need for developing more robust and trustworthy video foundation models, and VMDT provides a systematic framework for measuring and tracking progress toward this goal.

large language model, machine learning, natural language, (19 more...)

Neural Information Processing Systems

Country: North America > United States > California (0.45)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study > Negative Result (0.65)

Industry:

Law (1.00)
Information Technology > Security & Privacy (1.00)
Government (1.00)
(5 more...)

Technology:

Information Technology > Security & Privacy (1.00)
Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
(3 more...)

Add feedback

Improved Representation Steering for Language Models

Neural Information Processing SystemsJun-23-2026, 01:31:05 GMT

Steering methods for language models (LMs) seek to provide fine-grained and interpretable control over model generations by variously changing model inputs, weights, or representations to adjust behavior. Recent work has shown that adjusting weights or representations is often less effective than steering by prompting, for instance when wanting to introduce or suppress a particular concept. We demonstrate how to improve representation steering via our new Reference-free Preference Steering (RePS), a bidirectional preference-optimization objective that jointly does concept steering and suppression. We train three parameterizations of RePS and evaluate them on AXBENCH, a large-scale model steering benchmark. On Gemmamodels with sizes ranging from 2Bto 27B, RePS outperforms all existing steering methods trained with a language modeling objective and substantially narrows the gap with prompting - while promoting interpretability and minimizing parameter count. In suppression, RePS matches the language-modeling objective on Gemma-2 and outperforms it on the larger Gemma-3 variants while remaining resilient to prompt-based jailbreaking attacks that defeat prompting. Overall, our results suggest that RePS provides an interpretable and robust alternative to prompting for both steering and suppression.

large language model, machine learning, natural language, (17 more...)

Neural Information Processing Systems

Genre: