Goto

Collaborating Authors

 invitation


Thinking About Thinking: Evaluating Reasoning in Post-Trained Language Models

Singla, Pratham, Garg, Shivank, Singh, Ayush, Garg, Ishan, Saichandran, Ketan Suhaas

arXiv.org Artificial Intelligence

Recent advances in post-training techniques have endowed Large Language Models (LLMs) with enhanced capabilities for tackling complex, logic-intensive tasks through the generation of supplementary planning tokens. This development raises a fundamental question: Are these models aware of what they "learn" and "think"? To address this, we define three core competencies: (1) awareness of learned latent policies, (2) generalization of these policies across domains, and (3) alignment between internal reasoning traces and final outputs. We empirically evaluate these abilities on several tasks, each designed to require learning a distinct policy. Furthermore, we contrast the profiles of models post-trained via Supervised Fine-Tuning (SFT), Direct Policy Optimization (DPO), and Group Relative Policy Optimization (GRPO). Our findings indicate that RL-trained models not only demonstrate greater awareness of their learned behaviors and stronger generalizability to novel, structurally similar tasks than SFT models but also often exhibit weak alignment between their reasoning traces and final outputs, an effect most pronounced in GRPO-trained models.


Can Large Language Models Express Uncertainty Like Human?

Tao, Linwei, Yeh, Yi-Fan, Kai, Bo, Dong, Minjing, Huang, Tao, Lamb, Tom A., Yu, Jialin, Torr, Philip H. S., Xu, Chang

arXiv.org Artificial Intelligence

Large language models (LLMs) are increasingly used in high-stakes settings, where overconfident responses can mislead users. Reliable confidence estimation has been shown to enhance trust and task accuracy. Y et existing methods face practical barriers: logits are often hidden, multi-sampling is computationally expensive, and verbalized numerical uncertainty (e.g., giving a 0-100 score) deviates from natural communication. We revisit linguistic confidence (LC), where models express uncertainty through hedging language (e.g., probably, might), offering a lightweight and human-centered alternative. To advance this direction, we 1) release the first diverse, large-scale dataset of hedging expressions with human-annotated confidence scores, and 2) propose a lightweight mapper that converts hedges into confidence scores at near-zero cost. Building on these resources, we 3) conduct the first systematic study of LC across modern LLMs and QA benchmarks, revealing that while most LLMs underperform in expressing reliable LC, carefully designed prompting achieves competitive calibration and discriminability. Finally, we 4) introduce a fine-tuning framework that further improves LC reliability. Taken together, our work positions linguistic confidence as a scalable, efficient, and human-aligned approach to LLM uncertainty estimation, and calls for deeper exploration of this promising yet underexplored direction. The code and dataset are anonymously available at https://anonymous. Large language models (LLMs) are increasingly deployed in real-world applications, from education and healthcare to law and scientific discovery. While their capabilities make them powerful assistants, LLMs are also prone to hallucinations and factual errors, and human overreliance on their outputs can lead to serious consequences. For instance, a U.S. lawyer once submitted fabricated cases generated by ChatGPT, resulting in professional sanctions (ABC News, 2023). Recent social experiments demonstrate that people adjust their reliance on AI depending on how confident the model appears: reliable expressions of uncertainty can enhance trust, satisfaction, and task accuracy (Kim et al., 2024; Xu et al., 2025). These findings highlight the importance of associating reliable uncertainty estimates with LLM responses to support human decision-making. Ultimately, the conveyance of confidence plays a central role in shaping trust and guiding human-AI interaction. A growing body of work explores the extraction and representation of confidence in LLM outputs. These methods are simple and inexpensive but require access to model logits, which are typically unavailable in commercial LLM APIs. However, such scores rarely align with common user behavior or natural communication, as users do not typically phrase queries with explicit instructions like "Please output your confidence along with the answer."


We Politely Insist: Your LLM Must Learn the Persian Art of Taarof

Sadr, Nikta Gohari, Heidariasl, Sahar, Megerdoomian, Karine, Seyyed-Kalantari, Laleh, Emami, Ali

arXiv.org Artificial Intelligence

Large language models (LLMs) struggle to navigate culturally specific communication norms, limiting their effectiveness in global contexts. We focus on Persian taarof, a social norm in Iranian interactions, which is a sophisticated system of ritual politeness that emphasizes deference, modesty, and indirectness, yet remains absent from existing cultural benchmarks. We introduce TaarofBench, the first benchmark for evaluating LLM understanding of taarof, comprising 450 role-play scenarios covering 12 common social interaction topics, validated by native speakers. Our evaluation of five frontier LLMs reveals substantial gaps in cultural competence, with accuracy rates 40-48% below native speakers when taarof is culturally appropriate. Performance varies between interaction topics, improves with Persian-language prompts, and exhibits gender-based asymmetries. We also show that responses rated "polite" by standard metrics often violate taarof norms, indicating the limitations of Western politeness frameworks. Through supervised fine-tuning and Direct Preference Optimization, we achieve 21.8% and 42.3% improvement in model alignment with cultural expectations. Our human study with 33 participants (11 native Persian, 11 heritage, and 11 non-Iranian speakers) forms baselines in varying degrees of familiarity with Persian norms. This work lays the foundation for developing diverse and culturally aware LLMs, enabling applications that better navigate complex social interactions.


Exploring Effective Strategies for Building a Customised GPT Agent for Coding Classroom Dialogues

Bai, Luwei, Han, Dongkeun, Hennessy, Sara

arXiv.org Artificial Intelligence

This study investigates effective strategies for developing a customised GPT agent to code classroom dialogue. While classroom dialogue is widely recognised as a crucial element of education, its analysis remains challenging due to the need for a nuanced understanding of dialogic functions and the labour-intensive nature of manual transcript coding. Recent advancements in large language models offer promising avenues for automating this process. However, existing studies predominantly focus on training large-scale models or evaluating pre-trained models with fixed codebooks, which are often not applicable or replicable for dialogue researchers working with small datasets or customised coding schemes. Using GPT-4's MyGPT agent as a case, this study evaluates its baseline performance in coding classroom dialogue with a human codebook and examines how performance varies with different example inputs through a variable control method. Through a design-based research approach, it identifies a set of practical strategies, based on MyGPT's unique features, for configuring effective agents with limited data. The findings suggest that, despite some limitations, a MyGPT agent developed with these strategies can serve as a useful coding assistant by generating coding suggestions.


Donald Trump Held Another Million-Dollar 'Candlelight' Dinner--With Elon Musk in Tow

WIRED

An invitation to a "candlelight" dinner held this past Saturday at President Donald Trump's Mar-a-Lago club asked prospective guests to spend 1 million per seat. Trump attended the dinner along with Elon Musk, according to multiple photographs and videos of the event viewed by WIRED. Elon Musk, wearing his standard uniform of a black sport coat over a black T-shirt, was seen shaking hands and waving to other attendees. He was with a woman wearing a floor length gown who appeared to be Shivon Zilis, according to Instagram Reels posted by multiple guests. Zilis, a Neuralink executive who previously sat on the board of OpenAI, is the mother of four of Musk's 14 known children.


Enhanced Classroom Dialogue Sequences Analysis with a Hybrid AI Agent: Merging Expert Rule-Base with Large Language Models

Long, Yun, Zhang, Yu

arXiv.org Artificial Intelligence

Classroom dialogue plays a crucial role in fostering student engagement and deeper learning. However, analysing dialogue sequences has traditionally relied on either theoretical frameworks or empirical descriptions of practice, with limited integration between the two. This study addresses this gap by developing a comprehensive rule base of dialogue sequences and an Artificial Intelligence (AI) agent that combines expert-informed rule-based systems with a large language model (LLM). The agent applies expert knowledge while adapting to the complexities of natural language, enabling accurate and flexible categorisation of classroom dialogue sequences. By synthesising findings from over 30 studies, we established a comprehensive framework for dialogue analysis. The agent was validated against human expert coding, achieving high levels of precision and reliability. The results demonstrate that the agent provides theory-grounded and adaptive functions, tremendously enhancing the efficiency and scalability of classroom dialogue analysis, offering significant potential in improving classroom teaching practices and supporting teacher professional development.


The price of love: how much does dating cost – and who pays the bill?

The Guardian

Putting yourself out there always comes at a cost: you have to be vulnerable, open yourself up and risk rejection. These days it can also come with a hefty price tag. It's not just the cost of drinks or dinner to consider. Before you've even got to the awkward, age-old dance of who is going to foot the bill, you might have already forked out hundreds of pounds on a dating site to be in with the shot for a date. While some dating services are free, many now include tempting extra features that they claim will help you find more compatible connections, get noticed sooner and go on more dates.


Zelenskyy says 'victory plan' to end Russia war includes NATO membership

Al Jazeera

Ukrainian President Volodymyr Zelenskyy says his "victory plan" to end the war with Russia includes requests for specific weapons and an "unconditional" invitation to join NATO now. "If we start moving according to this victory plan now, it may be possible to end the war no later than next year," Zelenskyy said on Wednesday in a speech to the Verkhovna Rada, Ukraine's parliament. The first, he told lawmakers, was receiving an "unconditional invitation" to join the military alliance, which would show "how our partners truly see Ukraine's place in the security architecture". The Ukrainian leader recently concluded a whirlwind tour of several European capitals, trying to win approval for the five-point plan from Western partners, which have stopped short of publicly voicing their support for it so far. "Regardless of what [Russian President Vladimir] Putin wants, we must all change the circumstances so that Russia is forced to peace," he told parliament of the proposal that also includes military, political and economic elements. Ukraine's defence forces and weapons must be bolstered from Russian missile and drone attacks, he said, reiterating a call for his country's allies to lift restrictions on Ukraine's use of long-range arms on military targets in Russia.


iPhone 16 release date is LEAKED online - and it suggests there's not long to wait to see Apple's next flagship

Daily Mail - Science & tech

Apple fans might not have to wait much longer to see the company's new flagship smartphone, the iPhone 16. The California tech giant will unveil the latest generation of iPhones at an in-person event on September 10, according to an alleged online leak. A serial Apple leaker known as Majin Bu shared a screenshot on X, formerly Twitter, which claims to shown the invite to Apple's September special event. The colour of the Apple logo in the invitation also nods to the possibility that fans might be getting a new'bronze' colour for the titanium smartphone. However, social media commenters have been sceptical of the leak's authenticity and even Majin Bu himself says: 'I have no way of verifying that this information is real, but it all seems very plausible considering the latest news.'


Towards Objectively Benchmarking Social Intelligence for Language Agents at Action Level

Wang, Chenxu, Dai, Bin, Liu, Huaping, Wang, Baoyuan

arXiv.org Artificial Intelligence

Prominent large language models have exhibited human-level performance in many domains, even enabling the derived agents to simulate human and social interactions. While practical works have substantiated the practicability of grounding language agents in sandbox simulation or embodied simulators, current social intelligence benchmarks either stay at the language level or use subjective metrics. In pursuit of a more realistic and objective evaluation, we introduce the Social Tasks in Sandbox Simulation (STSS) benchmark, which assesses language agents \textbf{objectively} at the \textbf{action level} by scrutinizing the goal achievements within the multi-agent simulation. Additionally, we sample conversation scenarios to build a language-level benchmark to provide an economically prudent preliminary evaluation and align with prevailing benchmarks. To gauge the significance of agent architecture, we implement a target-driven planning (TDP) module as an adjunct to the existing agent. Our evaluative findings highlight that the STSS benchmark is challenging for state-of-the-art language agents. Furthermore, it effectively discriminates between distinct language agents, suggesting its usefulness as a benchmark for evaluating both language models and agent architectures.