Goto

Collaborating Authors

 Ranchi


DCRM: A Heuristic to Measure Response Pair Quality in Preference Optimization

Huang, Chengyu, Goyal, Tanya

arXiv.org Artificial Intelligence

Recent research has attempted to associate preference optimization (PO) performance with the underlying preference datasets. In this work, our observation is that the differences between the preferred response $y^+$ and dispreferred response $y^-$ influence what LLMs can learn, which may not match the desirable differences to learn. Therefore, we use distance and reward margin to quantify these differences, and combine them to get Distance Calibrated Reward Margin (DCRM), a metric that measures the quality of a response pair for PO. Intuitively, DCRM encourages minimal noisy differences and maximal desired differences. With this, we study 3 types of commonly used preference datasets, classified along two axes: the source of the responses and the preference labeling function. We establish a general correlation between higher DCRM of the training set and better learning outcome. Inspired by this, we propose a best-of-$N^2$ pairing method that selects response pairs with the highest DCRM. Empirically, in various settings, our method produces training datasets that can further improve models' performance on AlpacaEval, MT-Bench, and Arena-Hard over the existing training sets.


Pushing LLMs to Their Logical Reasoning Bound: The Role of Data Reasoning Intensity

Bi, Zhen, Hu, Zhenlin, Yang, Jinnan, Chen, Mingyang, Deng, Cheng, Xue, Yida, Yang, Zeyu, Shen, Qing, Liu, Zhenfang, Zhao, Kang, Zhang, Ningyu, Lou, Jungang

arXiv.org Artificial Intelligence

Recent advances in large language models (LLMs) highlight the importance of training data structure and quality in shaping reasoning behavior. However, most existing approaches focus on transforming data formats while neglecting the internal reasoning complexity of training samples, leaving the reasoning potential of data under-explored and underutilized. In this work, we posit that LLM logical reasoning performance is jointly constrained by the potential of the training data and the cognitive capacity of the model. To make this relationship measurable, we introduce Data Reasoning Intensity (DRI), a novel metric that quantifies the latent logical reasoning complexity of samples by decomposing and aggregating their logical structures. This allows us to analyze how well current LLMs utilize logical reasoning signals and identify performance gaps relative to data potential. Based on this insight, we introduce a re-cognizing optimization strategy that systematically enhances the logical reasoning intensity of training data. Rather than increasing data volume, our method re-optimizes existing samples to better align with the LLM's logical reasoning boundary. Extensive experiments show that our approach significantly improves performance and generalization over data-centric strategies. We further validate our method under a reinforcement learning framework. Our results indicate that prioritizing reasoning complexity in data rather than sheer scale or superficial form is essential to realizing LLMs' full cognitive potential.


Differential Information Distribution: A Bayesian Perspective on Direct Preference Optimization

Won, Yunjae, Lee, Hyunji, Hwang, Hyeonbin, Seo, Minjoon

arXiv.org Artificial Intelligence

Direct Preference Optimization (DPO) has been widely used for aligning language models with human preferences in a supervised manner. However, several key questions remain unresolved: the rationale behind its log-ratio reward, how the statistical structure of preference datasets shapes its training dynamics, and how those dynamics impact downstream capabilities. We approach these questions from a Bayesian perspective, interpreting the goal of preference optimization as learning the differential information required to update a reference policy into a target policy. To formalize this view, we introduce the Differential Information Distribution (DID), defined as the distribution over samples that carry the Bayesian evidence required to update policies. We introduce three complementary insights by viewing preference optimization through the DID. First, we find that DPO's log-ratio reward is uniquely justified when preferences encode the Differential Information needed to update a reference policy into the target policy. Second, we discuss how commonly observed training dynamics in DPO, including changes in log-likelihood and policy exploration, stem from a power-law DID relationship. Finally, we analyze how training dynamics influence downstream performance using the entropy of DID, a principled measure of uncertainty in the learned information. We observe that learning high-entropy DID improves open-ended instruction-following, while low-entropy DID benefits knowledge-intensive QA. Taken together, our results show that DPO's reward design, training dynamics, and downstream capabilities all emerge as natural consequences of learning Differential Information, offering both a principled theoretical foundation and practical guidance for preference-based alignment.


Through the Valley: Path to Effective Long CoT Training for Small Language Models

Luo, Renjie, Li, Jiaxi, Huang, Chen, Lu, Wei

arXiv.org Artificial Intelligence

Long chain-of-thought (CoT) supervision has become a common strategy to enhance reasoning in language models. While effective for large models, we identify a phenomenon we call Long CoT Degradation, in which small language models (SLMs; <=3B parameters) trained on limited long CoT data experience significant performance deterioration. Through extensive experiments on the Qwen2.5, LLaMA3 and Gemma3 families, we demonstrate that this degradation is widespread across SLMs. In some settings, models trained on only 8k long CoT examples lose up to 75% of their original performance before fine-tuning. Strikingly, we further observe that for some particularly small models, even training on 220k long CoT examples fails to recover or surpass their original performance prior to fine-tuning. Our analysis attributes this effect to error accumulation: while longer responses increase the capacity for multi-step reasoning, they also amplify the risk of compounding mistakes. Furthermore, we find that Long CoT Degradation may negatively impacts downstream reinforcement learning (RL), although this can be alleviated by sufficiently scaled supervised fine-tuning (SFT). Our findings challenge common assumptions about the benefits of long CoT training for SLMs and offer practical guidance for building more effective small-scale reasoning models.


Generative Propaganda

Daepp, Madeleine I. G., Cuevas, Alejandro, Ness, Robert Osazuwa, Wang, Vickie Yu-Ping, Nayak, Bharat Kumar, Mishra, Dibyendu, Cheng, Ti-Chung, Desai, Shaily, Pal, Joyojeet

arXiv.org Artificial Intelligence

Generative propaganda is the use of generative artificial intelligence (AI) to shape public opinion. To characterize its use in real-world settings, we conducted interviews with defenders (e.g., factcheckers, journalists, officials) in Taiwan and creators (e.g., influencers, political consultants, advertisers) as well as defenders in India, centering two places characterized by high levels of online propaganda. The term "deepfakes", we find, exerts outsized discursive power in shaping defenders' expectations of misuse and, in turn, the interventions that are prioritized. To better characterize the space of generative propaganda, we develop a taxonomy that distinguishes between obvious versus hidden and promotional versus derogatory use. Deception was neither the main driver nor the main impact vector of AI's use; instead, Indian creators sought to persuade rather than to deceive, often making AI's use obvious in order to reduce legal and reputational risks, while Taiwan's defenders saw deception as a subset of broader efforts to distort the prevalence of strategic narratives online. AI was useful and used, however, in producing efficiency gains in communicating across languages and modes, and in evading human and algorithmic detection. Security researchers should reconsider threat models to clearly differentiate deepfakes from promotional and obvious uses, to complement and bolster the social factors that constrain misuse by internal actors, and to counter efficiency gains globally.


CAR-BRAINet: Sub-6GHz Aided Spatial Adaptive Beam Prediction with Multi Head Attention for Heterogeneous Vehicular Networks

Menon, Aathira G, Krishnan, Prabu, Lal, Shyam

arXiv.org Artificial Intelligence

Heterogeneous Vehicular Networks (HetVNets) play a key role by stacking different communication technologies such as sub-6GHz, mm-wave and DSRC to meet diverse connectivity needs of 5G/B5G vehicular networks. HetVNet helps address the humongous user demands-but maintaining a steady connection in a highly mobile, real-world conditions remain a challenge. Though there has been ample of studies on beam prediction models a dedicated solution for HetVNets is sparsely explored. Hence, it is the need of the hour to develop a reliable beam prediction solution, specifically for HetVNets. This paper introduces a lightweight deep learning-based solution termed-"CAR-BRAINet" which consists of convolutional neural networks with a powerful multi-head attention (MHA) mechanism. Existing literature on beam prediction is largely studied under a limited, idealised vehicular scenario, often overlooking the real-time complexities and intricacies of vehicular networks. Therefore, this study aims to mimic the complexities of a real-time driving scenario by incorporating key factors such as prominent MAC protocols-3GPP-C-V2X and IEEE 802.11BD, the effect of Doppler shifts under high velocity and varying distance and SNR levels into three high-quality dynamic datasets pertaining to urban, rural and highway vehicular networks. CAR-BRAINet performs effectively across all the vehicular scenarios, demonstrating precise beam prediction with minimal beam overhead and a steady improvement of 17.9422% on the spectral efficiency over the existing methods. Thus, this study justifies the effectiveness of CAR-BRAINet in complex HetVNets, offering promising performance without relying on the location angle and antenna dimensions of the mobile users, and thereby reducing the redundant sensor-latency.


AEGIS: An Agent for Extraction and Geographic Identification in Scholarly Proceedings

Vishesh, Om, Khadilkar, Harshad, Akkil, Deepak

arXiv.org Artificial Intelligence

Keeping pace with the rapid growth of academia literature presents a significant challenge for researchers, funding bodies, and academic societies. To address the time-consuming manual effort required for scholarly discovery, we present a novel, fully automated system that transitions from data discovery to direct action. Our pipeline demonstrates how a specialized AI agent, 'Agent-E', can be tasked with identifying papers from specific geographic regions within conference proceedings and then executing a Robotic Process Automation (RPA) to complete a predefined action, such as submitting a nomination form. We validated our system on 586 papers from five different conferences, where it successfully identified every target paper with a recall of 100% and a near perfect accuracy of 99.4%. This demonstration highlights the potential of task-oriented AI agents to not only filter information but also to actively participate in and accelerate the workflows of the academic community.


Mitigating Spurious Correlations Between Question and Answer via Chain-of-Thought Correctness Perception Distillation

Xie, Hongyan, Yao, Yitong, Ban, Yikun, Huang, Zixuan, Wang, Deqing, Wu, Zhenhe, Su, Haoxiang, Wang, Chao, Song, Shuangyong

arXiv.org Artificial Intelligence

Large language models (LLMs) excel at reasoning tasks but are expensive to deploy. Thus small language models (SLMs) are fine-tuned on CoT data generated by LLMs to copy LLMs' abilities. However, these CoT data may include noisy rationales that either fail to substantiate the answers or contribute no additional information to support answer prediction, which leads SLMs to capture spurious correlations between questions and answers and compromise the quality of reasoning. In this work, we propose Chain-of-Thought Correctness Perception Distillation (CoPeD), which aims to improve the reasoning quality of the student model from the perspectives of task setting and data utilization. Firstly, we introduce a correctness-aware task setting that encourages the student model to predict answers based on correct rationales and revise them when they are incorrect. This setting improves the faithfulness of reasoning and allows the model to learn from its mistakes. Then, we propose a Correctness-Aware Weighted loss, which dynamically adjusts the contribution of each training instance based on the combined loss of the rationale and the answer. This strategy encourages the model to focus more on samples where the rationale offers stronger support for the correct answer. Experiments have shown that CoPeD is effective on both in-distribution (IND) and out-of-distribution (OOD) benchmark reasoning datasets.


Spoken in Jest, Detected in Earnest: A Systematic Review of Sarcasm Recognition -- Multimodal Fusion, Challenges, and Future Prospects

Gao, Xiyuan, Nayak, Shekhar, Coler, Matt

arXiv.org Artificial Intelligence

Sarcasm, a common feature of human communication, poses challenges in interpersonal interactions and human-machine interactions. Linguistic research has highlighted the importance of prosodic cues, such as variations in pitch, speaking rate, and intonation, in conveying sarcastic intent. Although previous work has focused on text-based sarcasm detection, the role of speech data in recognizing sarcasm has been underexplored. Recent advancements in speech technology emphasize the growing importance of leveraging speech data for automatic sarcasm recognition, which can enhance social interactions for individuals with neurodegenerative conditions and improve machine understanding of complex human language use, leading to more nuanced interactions. This systematic review is the first to focus on speech-based sarcasm recognition, charting the evolution from unimodal to multimodal approaches. It covers datasets, feature extraction, and classification methods, and aims to bridge gaps across diverse research domains. The findings include limitations in datasets for sarcasm recognition in speech, the evolution of feature extraction techniques from traditional acoustic features to deep learning-based representations, and the progression of classification methods from unimodal approaches to multimodal fusion techniques. In so doing, we identify the need for greater emphasis on cross-cultural and multilingual sarcasm recognition, as well as the importance of addressing sarcasm as a multimodal phenomenon, rather than a text-based challenge.


LLMs and their Limited Theory of Mind: Evaluating Mental State Annotations in Situated Dialogue

Kowalyshyn, Katharine, Scheutz, Matthias

arXiv.org Artificial Intelligence

What if large language models could not only infer human mindsets but also expose every blind spot in team dialogue such as discrepancies in the team members' joint understanding? We present a novel, two-step framework that leverages large language models (LLMs) both as human-style annotators of team dialogues to track the team's shared mental models (SMMs) and as automated discrepancy detectors among individuals' mental states. In the first step, an LLM generates annotations by identifying SMM elements within task-oriented dialogues from the Cooperative Remote Search Task (CReST) corpus. Then, a secondary LLM compares these LLM-derived annotations and human annotations against gold-standard labels to detect and characterize divergences. We define an SMM coherence evaluation framework for this use case and apply it to six CReST dialogues, ultimately producing: (1) a dataset of human and LLM annotations; (2) a reproducible evaluation framework for SMM coherence; and (3) an empirical assessment of LLM-based discrepancy detection. Our results reveal that, although LLMs exhibit apparent coherence on straightforward natural-language annotation tasks, they systematically err in scenarios requiring spatial reasoning or disambiguation of prosodic cues.