AITopics | Large Language Model

Collaborating Authors

Large Language Model

News Overviews Instructional Materials AI-Alerts Classics

On Learning-Curve Monotonicity for Maximum Likelihood Estimators

arXiv.org Machine LearningDec-12-2025

The property of learning-curve monotonicity, highlighted in a recent series of work by Loog, Mey and Viering, describes algorithms which only improve in average performance given more data, for any underlying data distribution within a given family. We establish the first nontrivial monotonicity guarantees for the maximum likelihood estimator in a variety of well-specified parametric settings. For sequential prediction with log loss, we show monotonicity (in fact complete monotonicity) of the forward KL divergence for Gaussian vectors with unknown covariance and either known or unknown mean, as well as for Gamma variables with unknown scale parameter. The Gaussian setting was explicitly highlighted as open in the aforementioned works, even in dimension 1. Finally we observe that for reverse KL divergence, a folklore trick yields monotonicity for very general exponential families. All results in this paper were derived by variants of GPT-5.2 Pro. Humans did not provide any proof strategies or intermediate arguments, but only prompted the model to continue developing additional results, and verified and transcribed its proofs.

monotonicity, proposition 1, theorem 1, (16 more...)

arXiv.org Machine Learning

2512.1022

Country:

North America > United States (0.14)
Asia > Japan > Honshū > Chūbu > Aichi Prefecture > Nagoya (0.04)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty > Bayesian Inference (0.61)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (0.61)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.49)
(3 more...)

Add feedback

Intelligently Weighting Multiple Reference Models for Direct Preference Optimization of LLMs

Wu, Skyler, Echarghaoui, Aymen

arXiv.org Machine LearningDec-12-2025

Fine-tuning is integral for aligning large language models (LLMs) with human preferences. Multiple-Reference Preference Optimization (MRPO) builds on Direct Preference Optimization (DPO) by fine-tuning LLMs on preference datasets while regularizing the policy towards a mixture of reference models to leverage their collective desirable properties. However, current methods for setting the reference weights are ad-hoc and statistically unsound, leading to unreliable performance. To address this, we introduce four new weighting strategies: two offline methods that leverage held-out validation signal; one online method that uses a sliding-window estimator to reduce overfitting; and an online method that treats reference weighting as a $K$-armed bandit via Thompson Sampling. Experiments using Qwen2.5-0.5B as the policy model and seven reference models from the Llama, Mistral, Qwen, Yi, and Phi families (0.5B-14B each) show that all 4 of our strategies outperform the current MRPO weighting methods on UltraFeedback and SafeRLHF in preference accuracy. More thought-provokingly, however, we find that single-reference DPO, using any of 6 out of 7 references, consistently outperforms all tested multiple-reference approaches -- calling into question the practical appeal of multiple-reference approaches.

accuracy, preference optimization, reference model, (15 more...)

arXiv.org Machine Learning

2512.1004

Country: North America > United States > California > Santa Clara County > Palo Alto (0.04)

Genre:

Research Report > New Finding (0.68)
Research Report > Experimental Study (0.68)

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

Add feedback

Latency-Response Theory Model: Evaluating Large Language Models via Response Accuracy and Chain-of-Thought Length

Xu, Zhiyu, Liu, Jia, Wang, Yixin, Gu, Yuqi

arXiv.org Machine LearningDec-12-2025

The proliferation of Large Language Models (LLMs) necessitates valid evaluation methods to guide downstream applications and actionable future improvements. The Item Response Theory (IRT) has recently emerged as a promising framework for evaluating LLMs via their response accuracy. Beyond simple response accuracy, LLMs' chain of thought (CoT) lengths serve as a vital indicator of their reasoning ability. To leverage the CoT length information to assist the evaluation of LLMs, we propose Latency-Response Theory (LaRT) to jointly model the response accuracy and CoT length by introducing the latent ability, latent speed, and a key correlation parameter between them. We derive an efficient estimation algorithm and establish rigorous identifiability results for the population parameters to ensure the statistical validity of estimation. Theoretical asymptotic analyses and simulation studies demonstrate LaRT's advantages over IRT in terms of higher estimation accuracy and shorter confidence intervals for latent traits. A key finding is that the asymptotic estimation precision of the latent ability under LaRT exceeds that of IRT whenever the latent ability and latent speed are correlated. We collect real responses from diverse LLMs on popular benchmark datasets. The application of LaRT reveals a strong negative correlation between the latent ability and latent speed in all benchmarks, with stronger correlation for more difficult benchmarks. This finding supports the intuition that higher reasoning ability correlates with slower speed and longer response latency. LaRT yields different LLM rankings than IRT and outperforms IRT across multiple key evaluation metrics including predictive power, item efficiency, ranking validity, and LLM evaluation efficiency. Code and data are available at https://github.com/Toby-X/Latency-Response-Theory-Model.

lart, latent ability, llm, (16 more...)

arXiv.org Machine Learning

2512.07019

Country:

North America > United States > Texas > Travis County > Austin (0.04)
North America > United States > Michigan (0.04)
Europe > United Kingdom > England > Oxfordshire > Oxford (0.04)
Europe > Ukraine > Kyiv Oblast > Kyiv (0.04)

Genre: Research Report (1.00)

Industry: Education (0.45)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.69)

Add feedback

Linear socio-demographic representations emerge in Large Language Models from indirect cues

Bouchaud, Paul, Ramaciotti, Pedro

arXiv.org Artificial IntelligenceDec-12-2025

We investigate how LLMs encode sociodemographic attributes of human conversational partners inferred from indirect cues such as names and occupations. We show that LLMs develop linear representations of user demographics within activation space, wherein stereotypically associated attributes are encoded along interpretable geometric directions. We first probe residual streams across layers of four open transformer-based LLMs (Magistral 24B, Qwen3 14B, GPT-OSS 20B, OLMo2-1B) prompted with explicit demographic disclosure. We show that the same probes predict demographics from implicit cues: names activate census-aligned gender and race representations, while occupations trigger representations correlated with real-world workforce statistics. These linear representations allow us to explain demographic inferences implicitly formed by LLMs during conversation. We demonstrate that these implicit demographic representations actively shape downstream behavior, such as career recommendations. Our study further highlights that models that pass bias benchmark tests may still harbor and leverage implicit biases, with implications for fairness when applied at scale.

large language model, machine learning, natural language, (21 more...)

arXiv.org Artificial Intelligence

2512.10065

Country: North America > United States > Minnesota (0.28)

Genre: Research Report > New Finding (0.69)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Intrinsic Self-Correction in LLMs: Towards Explainable Prompting via Mechanistic Interpretability

Lee, Yu-Ting, Chang, Fu-Chieh, Shih, Hui-Ying, Wu, Pei-Yuan

arXiv.org Artificial IntelligenceDec-12-2025

Despite its empirical success, the mechanism of intrinsic self-correction remains unclear. Prior work has attributed it to reduced model uncertainty and argues that performance gains stem from activatingtask-relevant latentconcepts,as shown by probing [3]. Complementarily, Liu et al. [4] probe morality in attention and MLP activations, contending that intrinsic moral self-correctionmay merely exploit a shortcut to produce more moraloutputs.Alonga relatedaxis,Li et al. [8] identifymodel confidence as a crucial factor for intrinsic self-correction, and argue that ignoring it can cause over-criticism and unreliable assessments of self-correction efficacy. Theoretically, Wang et al. [9] view self-correction through in-context learning: selfexaminations act as reward signals that let LLMs iteratively refine responses without parameter updates. What is missing is a mechanistic analysis of how selfcorrection prompts steer a model's internal representations. Specifically, existing works only reveal what is encoded in activations, but not how prompting causally displaces representations during generation. We directly analyze the displacement in representation space induced by prompting, leading us to ask: Does intrinsic self-correction prompting act as representation steering along interpretable latent directions? We approach this research question via mechanistic interpretability, with a methodology consisting of the following steps: (1) We define a prompt-induced shift from a selfcorrection prompt as the round-wise difference in activations.

large language model, machine learning, natural language, (16 more...)

arXiv.org Artificial Intelligence

2505.11924

Country: Asia > Taiwan (0.16)

Genre: Research Report > New Finding (0.67)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

Differential Smoothing Mitigates Sharpening and Improves LLM Reasoning

Gai, Jingchu, Zeng, Guanning, Zhang, Huaqing, Raghunathan, Aditi

arXiv.org Artificial IntelligenceDec-12-2025

It is widely recognized that reinforcement learning (RL) fine-tuning of large language models often leads to diversity collapse, where outputs lack variety. Prior work has proposed a range of heuristics to counteract this effect, but these methods are ad hoc: they frequently trade off correctness for diversity, their effectiveness varies across tasks, and in some cases they even contradict one another. In this work, we place these observations on a rigorous foundation. We first provide a formal proof of why RL fine-tuning exhibits diversity collapse via a selection and reinforcement bias. Next, we make a key observation that any reward modification to address diversity collapse only needs to be applied on the correct trajectories. Building directly on this analysis, we introduce a principled method -- differential smoothing -- that provably improves both correctness and diversity, outperforming vanilla RL as well as widely used entropy-based heuristics. Our theory precisely characterizes when existing heuristics help and why they fail, while showing that differential smoothing is universally superior. Extensive experiments with models from 1B to 7B parameters, across domains including CountDown and real-world mathematical reasoning, demonstrate consistent gains. Differential smoothing improves both Pass@1 and Pass@k, with up to 6.7% improvements on AIME24 dataset.

large language model, machine learning, trajectory, (18 more...)

arXiv.org Artificial Intelligence

2511.19942

Genre: Research Report > New Finding (0.93)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.45)

Add feedback

It Hears, It Sees too: Multi-Modal LLM for Depression Detection By Integrating Visual Understanding into Audio Language Models

Zhao, Xiangyu, Shen, Yaling, Jiang, Yiwen, Wang, Zimu, Liu, Jiahe, Cheng, Maxmartwell H, Oliveira, Guilherme C, Desimone, Robert, Dwyer, Dominic, Ge, Zongyuan

arXiv.org Artificial IntelligenceDec-12-2025

Depression is one of the most prevalent mental health disorders globally. In recent years, multi-modal data, such as speech, video, and transcripts, has been increasingly used to develop AI-assisted depression assessment systems. Large language models have further advanced this field due to their strong language understanding and generalization capabilities. However, conventional LLMs remain text-centric and cannot process the rich non-verbal cues found in audio and visual modalities, which are critical components in mental health evaluation. While multi-modal LLMs offer a promising direction, few are tailored for psychological applications. In this study, we propose a novel multi-modal LLM framework for depression detection. This fine-grained alignment improves modeling of temporal dynamics across modalities while reducing the need for extensive training data and computational resources. Experiments on the DAIC-WoZ dataset demonstrate that our model outperforms both single-modality approaches and previous multi-modal methods. Moreover, the proposed framework can be extended to incorporate additional physiological signals, paving the way for broader clinical applications beyond mental health. Depression has emerged as a critical concern in the field of mental health, affecting a broad population across various age groups. Particularly, the incidence of depression among adolescents has surged over the past decade, raising significant social and public health concerns (Thapar et al., 2022).

large language model, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2511.19877

Genre: Research Report > New Finding (0.66)

Industry: Health & Medicine > Therapeutic Area > Psychiatry/Psychology > Mental Health (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

HunyuanOCR Technical Report

Hunyuan Vision Team, null, Lyu, Pengyuan, Wan, Xingyu, Li, Gengluo, Peng, Shangpin, Wang, Weinong, Wu, Liang, Shen, Huawen, Zhou, Yu, Tang, Canhui, Yang, Qi, Peng, Qiming, Luo, Bin, Yang, Hower, Zhang, Xinsong, Zhang, Jinnian, Peng, Houwen, Yang, Hongming, Xie, Senhao, Zhou, Longsha, Pei, Ge, Wu, Binghong, Yan, Rui, Wu, Kan, Yang, Jieneng, Wang, Bochao, Liu, Kai, Zhu, Jianchen, Jiang, Jie, Linus, null, Hu, Han, Zhang, Chengquan

arXiv.org Artificial IntelligenceDec-12-2025

This paper presents HunyuanOCR, a commercial-grade, open-source, and lightweight (1B parameters) Vision-Language Model (VLM) dedicated to OCR tasks. The architecture comprises a Native Vision Transformer (ViT) and a lightweight LLM connected via an MLP adapter. HunyuanOCR demonstrates superior performance, outperforming commercial APIs, traditional pipelines, and larger models (e.g., Qwen3-VL-4B). Specifically, it surpasses current public solutions in perception tasks (Text Spotting, Parsing) and excels in semantic tasks (IE, Text Image Translation), securing first place in the ICDAR 2025 DIMT Challenge (Small Model Track). Furthermore, it achieves state-of-the-art (SOTA) results on OCRBench among VLMs with fewer than 3B parameters. HunyuanOCR achieves breakthroughs in three key aspects: 1) Unifying Versatility and Efficiency: We implement comprehensive support for core capabilities including spotting, parsing, IE, VQA, and translation within a lightweight framework. This addresses the limitations of narrow "OCR expert models" and inefficient "General VLMs". 2) Streamlined End-to-End Architecture: Adopting a pure end-to-end paradigm eliminates dependencies on pre-processing modules (e.g., layout analysis). This fundamentally resolves error propagation common in traditional pipelines and simplifies system deployment. 3) Data-Driven and RL Strategies: We confirm the critical role of high-quality data and, for the first time in the industry, demonstrate that Reinforcement Learning (RL) strategies yield significant performance gains in OCR tasks. HunyuanOCR is officially open-sourced on HuggingFace. We also provide a high-performance deployment solution based on vLLM, placing its production efficiency in the top tier. We hope this model will advance frontier research and provide a solid foundation for industrial applications.

large language model, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2511.19575

Country: Asia (0.28)

Genre:

Workflow (1.00)
Research Report (1.00)

Industry:

Education (1.00)
Media (0.67)
Health & Medicine (0.67)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
(2 more...)

Add feedback

Thinking Ahead: Foresight Intelligence in MLLMs and World Models

Gong, Zhantao, Fan, Liaoyuan, Guo, Qing, Xu, Xun, Yang, Xulei, Li, Shijie

arXiv.org Artificial IntelligenceDec-12-2025

In this work, we define Foresight Intelligence as the capability to anticipate and interpret future events-an ability essential for applications such as autonomous driving, yet largely overlooked by existing research. To bridge this gap, we introduce FSU-QA, a new Visual Question-Answering (VQA) dataset specifically designed to elicit and evaluate Foresight Intelligence. Using FSU-QA, we conduct the first comprehensive study of state-of-the-art Vision-Language Models (VLMs) under foresight-oriented tasks, revealing that current models still struggle to reason about future situations. Beyond serving as a benchmark, FSU-QA also enables the assessment of world models by measuring the semantic coherence of their generated predictions, quantified through performance gains when VLMs are augmented with such outputs. Our experiments further demonstrate that FSU-QA can effectively enhance foresight reasoning: even small VLMs fine-tuned on FSU-QA surpass much larger, advanced models by a substantial margin. Together, these findings position FSU-QA as a principled foundation for developing next-generation models capable of truly anticipating and understanding future events.

large language model, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

2511.18735

Country: Asia (0.46)

Genre: Research Report (0.82)

Industry: Information Technology (0.49)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
(2 more...)

Add feedback

Beyond Words and Pixels: A Benchmark for Implicit World Knowledge Reasoning in Generative Models

Han, Tianyang, Su, Junhao, Hu, Junjie, Yang, Peizhen, Shi, Hengyu, Luo, Junfeng, Gao, Jialin

arXiv.org Artificial IntelligenceDec-12-2025

Text-to-image (T2I) models today are capable of producing photorealistic, instruction-following images, yet they still frequently fail on prompts that require implicit world knowledge. Existing evaluation protocols either emphasize compositional alignment or rely on single-round VQA-based scoring, leaving critical dimensions such as knowledge grounding, multi-physics interactions, and auditable evidence-substantially undertested. To address these limitations, we introduce PicWorld, the first comprehensive benchmark that assesses the grasp of implicit world knowledge and physical causal reasoning of T2I models. This benchmark consists of 1,100 prompts across three core categories. To facilitate fine-grained evaluation, we propose PW-Agent, an evidence-grounded multi-agent evaluator to hierarchically assess images on their physical realism and logical consistency by decomposing prompts into verifiable visual evidence. We conduct a thorough analysis of 17 mainstream T2I models on PicWorld, illustrating that they universally exhibit a fundamental limitation in their capacity for implicit world knowledge and physical causal reasoning to varying degrees. The findings highlight the need for reasoning-aware, knowledge-integrative architectures in future T2I systems.

arxiv preprint arxiv, large language model, machine learning, (17 more...)

arXiv.org Artificial Intelligence

2511.18271

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.94)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.47)

Add feedback