AITopics | honesty

Collaborating Authors

honesty

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Alignment for Honesty

Neural Information Processing SystemsMar-21-2026, 04:11:09 GMT

Recent research has made significant strides in aligning large language models (LLMs) with helpfulness and harmlessness. In this paper, we argue for the importance of alignment for \emph{honesty}, ensuring that LLMs proactively refuse to answer questions when they lack knowledge, while still not being overly conservative. However, a pivotal aspect of alignment for honesty involves discerning an LLM's knowledge boundaries, which demands comprehensive solutions in terms of metric development, benchmark creation, and training methodologies. We address these challenges by first establishing a precise problem definition and defining ``honesty'' inspired by the Analects of Confucius. This serves as a cornerstone for developing metrics that effectively measure an LLM's honesty by quantifying its progress post-alignment. Furthermore, we introduce a flexible training framework which is further instantiated by several efficient fine-tuning techniques that emphasize honesty without sacrificing performance on other tasks. Our extensive experiments reveal that these aligned models show a marked increase in honesty, as indicated by our proposed metrics.

artificial intelligence, large language model, natural language, (6 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

Add feedback

7428e6db752171d6b832c53b2ed297ab-Paper-Conference.pdf

Neural Information Processing SystemsFeb-15-2026, 21:25:03 GMT

First, we formalize the problem definition.Weintroducetheconceptof"Idon'tknow (idk) responses" and in this context, honesty necessitates that an aligned LLM provides idk responses for unknown questions and correct responses for known questions.

large language model, machine learning, natural language, (20 more...)

Neural Information Processing Systems

Country:

North America > United States > Mississippi (0.04)
North America > Canada > Ontario > Toronto (0.04)
Africa > Rwanda > Kigali > Kigali (0.04)
(8 more...)

Genre: Personal (0.93)

Industry:

Education (0.93)
Media (0.68)
Law (0.67)
(2 more...)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.94)

Add feedback

Incentivizing Honesty among Competitors in Collaborative Learning and Optimization

Neural Information Processing SystemsFeb-8-2026, 09:17:18 GMT

However, in many cases, potential participants in such collaborative schemes are competitors on a downstream task, such as firms that each aim to attract customers by providing the best recommendations.

artificial intelligence, machine learning, social media, (20 more...)

Neural Information Processing Systems

Country:

Europe > Germany > Baden-Württemberg > Tübingen Region > Tübingen (0.14)
Europe > Switzerland > Zürich > Zürich (0.04)
North America > United States > California (0.04)
(2 more...)

Industry:

Leisure & Entertainment > Games (0.67)
Information Technology > Security & Privacy (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.68)
Information Technology > Communications > Social Media (0.67)
(2 more...)

Add feedback

HonestLLM: TowardanHonestandHelpfulLarge LanguageModel

Neural Information Processing SystemsFeb-7-2026, 21:03:20 GMT

To mitigate this gap, we first refine and extend the definition of honesty inLLMs based onthedefinition proposed byAskelletal.[14],astheability

large language model, machine learning, natural language, (21 more...)

Neural Information Processing Systems

Country: North America > United States (0.04)

Genre: Research Report (1.00)

Industry:

Media (1.00)
Transportation > Air (0.67)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.70)

Add feedback

Persuading Farsighted Receivers in MDPs: the Power of Honesty

Neural Information Processing SystemsDec-24-2025, 10:52:52 GMT

Bayesian persuasion studies the problem faced by an informed sender who strategically discloses information to influence the behavior of an uninformed receiver. Recently, a growing attention has been devoted to settings where the sender and the receiver interact sequentially, in which the receiver's decision-making problem is usually modeled as a Markov decision process (MDP). However, the literature focuses on computing optimal information-revelation policies (a.k.a.

farsighted receiver, markovian, name change, (7 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning (0.55)

Add feedback

Honesty Is the Best Policy: Defining and Mitigating AI Deception

Neural Information Processing SystemsDec-23-2025, 19:22:03 GMT

Deceptive agents are a challenge for the safety, trustworthiness, and cooperation of AI systems. We focus on the problem that agents might deceive in order to achieve their goals (for instance, in our experiments with language models, the goal of being evaluated as truthful).There are a number of existing definitions of deception in the literature on game theory and symbolic AI, but there is no overarching theory of deception for learning agents in games. We introduce a formaldefinition of deception in structural causal games, grounded in the philosophyliterature, and applicable to real-world machine learning systems.Several examples and results illustrate that our formal definition aligns with the philosophical and commonsense meaning of deception.Our main technical result is to provide graphical criteria for deception. We show, experimentally, that these results can be used to mitigate deception in reinforcement learning agents and language models.

best policy, deception, name change, (6 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.61)

Add feedback

WOLF: Werewolf-based Observations for LLM Deception and Falsehoods

Agarwal, Mrinal, Rana, Saad, Sundoro, Theo, Berhe, Hermela, Kim, Spencer, Sharma, Vasu, O'Brien, Sean, Zhu, Kevin

arXiv.org Artificial IntelligenceDec-11-2025

Deception is a fundamental challenge for multi-agent reasoning: effective systems must strategically conceal information while detecting misleading behavior in others. Yet most evaluations reduce deception to static classification, ignoring the interactive, adversarial, and longitudinal nature of real deceptive dynamics. Large language models (LLMs) can deceive convincingly but remain weak at detecting deception in peers. We present WOLF, a multi-agent social deduction benchmark based on Werewolf that enables separable measurement of deception production and detection. WOLF embeds role-grounded agents (Villager, Werewolf, Seer, Doctor) in a programmable LangGraph state machine with strict night-day cycles, debate turns, and majority voting. Every statement is a distinct analysis unit, with self-assessed honesty from speakers and peer-rated deceptiveness from others. Deception is categorized via a standardized taxonomy (omission, distortion, fabrication, misdirection), while suspicion scores are longitudinally smoothed to capture both immediate judgments and evolving trust dynamics. Structured logs preserve prompts, outputs, and state transitions for full reproducibility. Across 7,320 statements and 100 runs, Werewolves produce deceptive statements in 31% of turns, while peer detection achieves 71-73% precision with ~52% overall accuracy. Precision is higher for identifying Werewolves, though false positives occur against Villagers. Suspicion toward Werewolves rises from ~52% to over 60% across rounds, while suspicion toward Villagers and the Doctor stabilizes near 44-46%. This divergence shows that extended interaction improves recall against liars without compounding errors against truthful roles. WOLF moves deception evaluation beyond static datasets, offering a dynamic, controlled testbed for measuring deceptive and detective capacity in adversarial multi-agent interaction.

large language model, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2512.09187

Genre: Research Report (0.85)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.48)

Add feedback

Depth-Wise Activation Steering for Honest Language Models

Góral, Gracjan, Winkels, Marysia, Basart, Steven

arXiv.org Artificial IntelligenceDec-9-2025

Large language models sometimes assert falsehoods despite internally representing the correct answer, failures of honesty rather than accuracy, which undermines auditability and safety. Existing approaches largely optimize factual correctness or depend on retraining and brittle single-layer edits, offering limited leverage over truthful reporting. We present a training-free activation steering method that weights steering strength across network depth using a Gaussian schedule. On the MASK benchmark, which separates honesty from knowledge, we evaluate seven models spanning the LLaMA, Qwen, and Mistral families and find that Gaussian scheduling improves honesty over no-steering and single-layer baselines in six of seven models. Equal-budget ablations on LLaMA-3.1-8B-Instruct and Qwen-2.5-7B-Instruct show the Gaussian schedule outperforms random, uniform, and box-filter depth allocations, indicating that how intervention is distributed across depth materially affects outcomes beyond total strength. The method is simple, model-agnostic, requires no finetuning, and provides a low-cost control knob for eliciting truthful reporting from models' existing capabilities.

large language model, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2512.07667

Country:

North America > United States (0.93)
Asia (0.93)

Genre: Research Report (0.66)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)

Add feedback

SDA: Steering-Driven Distribution Alignment for Open LLMs without Fine-Tuning

Xia, Wei, Deng, Zhi-Hong

arXiv.org Artificial IntelligenceNov-21-2025

With the rapid advancement of large language models (LLMs), their deployment in real-world applications has become increasingly widespread. LLMs are expected to deliver robust performance across diverse tasks, user preferences, and practical scenarios. However, as demands grow, ensuring that LLMs produce responses aligned with human intent remains a foundational challenge. In particular, aligning model behavior effectively and efficiently during inference, without costly retraining or extensive supervision, is both a critical requirement and a non-trivial technical endeavor. To address the challenge, we propose SDA (Steering-Driven Distribution Alignment), a training-free and model-agnostic alignment framework designed for open-source LLMs. SDA dynamically redistributes model output probabilities based on user-defined alignment instructions, enhancing alignment between model behavior and human intents without fine-tuning. The method is lightweight, resource-efficient, and compatible with a wide range of open-source LLMs. It can function independently during inference or be integrated with training-based alignment strategies. Moreover, SDA supports personalized preference alignment, enabling flexible control over the model response behavior. Empirical results demonstrate that SDA consistently improves alignment performance across 8 open-source LLMs with varying scales and diverse origins, evaluated on three key alignment dimensions, helpfulness, harmlessness, and honesty (3H). Specifically, SDA achieves average gains of 64.4% in helpfulness, 30% in honesty and 11.5% in harmlessness across the tested models, indicating its effectiveness and generalization across diverse models and application scenarios.

arxiv preprint arxiv, large language model, machine learning, (18 more...)

arXiv.org Artificial Intelligence

2511.16324

Genre: Research Report > New Finding (0.34)

Industry: Health & Medicine (0.46)

Technology: