Goto

Collaborating Authors

 dishonesty


But what is your honest answer? Aiding LLM-judges with honest alternatives using steering vectors

arXiv.org Artificial Intelligence

Detecting subtle forms of dishonesty like sycophancy and manipulation in Large Language Models (LLMs) remains challenging for both humans and automated evaluators, as these behaviors often appear through small biases rather than clear false statements. We introduce Judge Using Safety-Steered Alternatives (JUSSA), a novel framework that employs steering vectors not to improve model behavior directly, but to enhance LLM judges' evaluation capabilities. JUSSA applies steering vectors during inference to generate more honest alternatives, providing judges with contrastive examples that make subtle dishonest patterns easier to detect. While existing evaluation methods rely on black-box evaluation, JUSSA leverages model internals to create targeted comparisons from single examples. We evaluate our method on sycophancy detection and introduce a new manipulation dataset covering multiple types of manipulation. Our results demonstrate that JUSSA effectively improves detection accuracy over single-response evaluation in various cases. Analysis across judge models reveals that JUSSA helps weaker judges on easier dishonesty detection tasks, and stronger judges on harder tasks. Layer-wise experiments show how dishonest prompts cause representations to diverge from honest ones in middle layers, revealing where steering interventions are most effective for generating contrastive examples. By demonstrating that steering vectors can enhance safety evaluation rather than just modify behavior, our work opens new directions for scalable model auditing as systems become increasingly sophisticated.


Dishonesty in Helpful and Harmless Alignment

arXiv.org Artificial Intelligence

Humans tell lies when seeking rewards. Large language models (LLMs) are aligned to human values with reinforcement learning where they get rewards if they satisfy human preference. We find that this also induces dishonesty in helpful and harmless alignment where LLMs tell lies in generating harmless responses. Using the latest interpreting tools, we detect dishonesty, show how LLMs can be harmful if their honesty is increased, and analyze such phenomena at the parameter-level. Given these preliminaries and the hypothesis that reward-seeking stimulates dishonesty, we theoretically show that this dishonesty can in-turn decrease the alignment performances and augment reward-seeking alignment with representation regularization. Experimental results, including GPT-4 evaluated win-rates, perplexities, and cases studies demonstrate that we can train more honest, helpful, and harmless LLMs. We will make all our codes and results be open-sourced upon this paper's acceptance.


Educators object to ChatGPT, an AI that 'writes' papers for students - Washington Times

#artificialintelligence

Educators across the U.S. are sounding the alarm over ChatGPT, an upstart artificial intelligence that can write term papers for students based on keywords without clear signs of plagiarism. "I have a lot of experience of students cheating, and I have to say ChatGPT allows for an unprecedented level of dishonesty," said Joy Kutaka-Kennedy, a member of the American Educational Research Association and education professor at National University. "Do we really want professionals serving us who cheated their way into their credentials?" Trey Vasquez, a special education professor at the University of Central Florida, recently tested the next-generation "chatbot" with a group of other professors and students. They asked it to summarize an academic article, create a computer program, and write two 400-word essays on the use and limits of AI in education.


Are Customers Lying to Your Chatbot?

#artificialintelligence

Automated customer service systems that use tools such as online forms, chatbots, and other digital interfaces have become increasingly common across a wide range of industries. These tools offer many benefits to both companies and their customers — but new research suggests they can also come at a cost: Through two simple experiments, researchers found that people are more than twice as likely to lie when interacting with a digital system than when talking to a human. This is because one of the main psychological forces that encourages us to be honest is an intrinsic desire to protect our reputations, and interacting with a machine fundamentally poses less of a reputational risk than talking with a real human. The good news is, the researchers also found that customers who are more likely to cheat will often choose to use a digital (rather than human) communication system, giving companies an avenue to identify users who are more likely to cheat. Of course, there’s no eliminating digital dishonesty. But with a better understanding of the psychology that makes people more or less likely to lie, organizations can build systems that discourage fraud, identify likely cases of cheating, and proactively nudge people to be more honest.


A Report About Lie Detector App - very soon app might tell if you lie or not - Leamtechi News

#artificialintelligence

Very soon, your phone might be able to tell if you are lying or telling the truth. There is new machine algorithm wants to tap into the digital interactions that reveal when you are bluffing. Researchers have been finding some ways in which they can turn your phone into a lie detector instrument. There is a new machine learning algorithm which has been built by computer scientists at the University of Copenhagen which can detect honesty and dishonesty by analyzing the way you swipe or tap a smartphone. The research is based on the assumption that dishonesty interactions always take longer and involve more hand movement than honesty interaction.


Neuroscientists show how tiny fibs snowball into big lies

Los Angeles Times

A little dishonesty goes a long way. Scientists who studied the brain activity of people who told small lies to benefit themselves found that these fibs appeared to pave the way to telling whoppers later. The findings, published in the journal Nature Neuroscience, demonstrate how self-serving lies can escalate and offer a window into the processes in the brain at work. It's commonly held wisdom that small transgressions often lead to bigger and bigger ones, study coauthor Tali Sharot of University College London said in a news briefing. "Whether it's evading tax, infidelity, doping in sports, making up data in science, or financial fraud, deceivers often recall how small acts of dishonesty snowballed over time and they suddenly found themselves committing quite large crimes," Sharot said.


The Most Intelligent Robots Are Those that Exaggerate: Examining Robot Exaggeration

AAAI Conferences

This paper presents a model of exaggeration suitable for implementation on a robot. Exaggeration is an interest form of dishonesty in that it serves as a tradeoff between the different costs associated with lying and the reward received by having one’s lie accepted. Moreover, exaggeration offers the deceiver additional control in the form of much the exaggerated statement differs from the truth. We use a color guessing game to examine the different tradeoffs between these costs and rewards and their impact on exaggeration. Our results indicate some amount of exaggeration is the preferred option during most early interactions. Further, because the cost of lying increases linear with the number of lies, exaggeration decreases with additional interactions. We conclude by arguing why social robots must be capable of lying.


Dishonest Reasoning by Abduction

AAAI Conferences

This paper studies a computational logic for dishonest reasoning. We introduce logic programs with disinformation to represent and reason with dishonesty. We then consider two different cases of dishonesty: deductive dishonesty and abductive dishonesty. The former misleads another agent to deduce wrong conclusions, while the latter interrupts another agent to abduce correct explanations. In deductive or abductive dishonesty, an agent can perform different types of dishonest reasoning such as lying, bullshitting, and withholding information. We show that these different types of dishonest reasoning are characterized by extended abduction, and address their computational methods using abductive logic programming.