Goto

Collaborating Authors

 Media


Concept Incongruence: An Exploration of Time and Death in Role Playing

arXiv.org Artificial Intelligence

Consider this prompt "Draw a unicorn with two horns". Should large language models (LLMs) recognize that a unicorn has only one horn by definition and ask users for clarifications, or proceed to generate something anyway? We introduce concept incongruence to capture such phenomena where concept boundaries clash with each other, either in user prompts or in model representations, often leading to under-specified or mis-specified behaviors. In this work, we take the first step towards defining and analyzing model behavior under concept incongruence. Focusing on temporal boundaries in the Role-Play setting, we propose three behavioral metrics--abstention rate, conditional accuracy, and answer rate--to quantify model behavior under incongruence due to the role's death. We show that models fail to abstain after death and suffer from an accuracy drop compared to the Non-Role-Play setting. Through probing experiments, we identify two main causes: (i) unreliable encoding of the "death" state across different years, leading to unsatisfactory abstention behavior, and (ii) role playing causes shifts in the model's temporal representations, resulting in accuracy drops. We leverage these insights to improve consistency in the model's abstention and answer behaviors. Our findings suggest that concept incongruence leads to unexpected model behaviors and point to future directions on improving model behavior under concept incongruence.


WebNovelBench: Placing LLM Novelists on the Web Novel Distribution

arXiv.org Artificial Intelligence

Robustly evaluating the long-form storytelling capabilities of Large Language Models (LLMs) remains a significant challenge, as existing benchmarks often lack the necessary scale, diversity, or objective measures. To address this, we introduce WebNovelBench, a novel benchmark specifically designed for evaluating long-form novel generation. WebNovelBench leverages a large-scale dataset of over 4,000 Chinese web novels, framing evaluation as a synopsis-to-story generation task. We propose a multi-faceted framework encompassing eight narrative quality dimensions, assessed automatically via an LLM-as-Judge approach. Scores are aggregated using Principal Component Analysis and mapped to a percentile rank against human-authored works. Our experiments demonstrate that WebNovelBench effectively differentiates between human-written masterpieces, popular web novels, and LLM-generated content. We provide a comprehensive analysis of 24 state-of-the-art LLMs, ranking their storytelling abilities and offering insights for future development. This benchmark provides a scalable, replicable, and data-driven methodology for assessing and advancing LLM-driven narrative generation.


Washington Post urges Congress act to prevent another cover-up of president's health amid Biden revelations

FOX News

CNN host Jake Tapper told Joe Scarborough during a Wednesday conversation on "Morning Joe" that former President Biden made an effort to convince the MSNBC host that he was fit to run for re-election. The Washington Post editorial board called for more oversight of the Oval Office on Wednesday to ensure a cover-up of the president's health doesn't happen again following revelations in a bombshell book alleging the White House hid former President Joe Biden's decline from the public. "It now seems that, for a considerable time, Biden might have lacked the stamina and cognitive capacity the job demands -- and that his family and closest aides concealed this from the public," the paper's editorial board wrote. "Their apparent decision to put personal loyalties ahead of their duty to the country must be reckoned with. A legal mechanism should be considered to ensure that this doesn't happen again," the board proposed.


Cannes Is Rolling Out the Red Carpet for One of This Century's Most Controversial Figures

Slate

Although the Cannes Film Festival is the world's most prestigious movie showcase, its spotlight rarely falls on nonfiction film. Years go by without a single documentary competing for its biggest honor, the Palme d'Or, and there is no separate documentary prize. Juliette Binoche, the president of this year's jury, devoted part of her opening-night remarks to Fatma Hassona, the Palestinian photojournalist who was killed in an Israeli airstrike the day after it was announced that her documentary Put Your Soul on Your Hand and Walk would be premiering at Cannes. But the film itself was slotted into a low-profile sidebar devoted to independent productions. The festival did, however, roll out the red carpet for The Six Billion Dollar Man, Eugene Jarecki's portrait of WikiLeaks founder Julian Assange, which premiered out of competition on Wednesday evening.


I Talked to the Writer Who Got Caught Publishing ChatGPT-Written Slop. I Get Why He Did It.

Slate

Sign up for the Slatest to get the most insightful analysis, criticism, and advice out there, delivered to your inbox daily. Over the past week, at least two venerable American newspapers--the Chicago Sun-Times and the Philadelphia Inquirer--published a 56-page insert of summer content that was in large part produced by A.I. The most glaring evidence was a now-notorious "summer reading list," which recommended 15 books, five of them real, 10 of them imaginary, with summaries of fake titles like Isabel Allende's Tidewater Dreams, Min Jin Lee's Nightshade Market, Rebecca Makkai's Boiling Point, and Percival Everett's The Rainmakers. The authors exist; the books do not. The rest of the section, which included anodyne listicles about summer activities, barbecuing, and photography, soon attracted additional scrutiny.


Source framing triggers systematic evaluation bias in Large Language Models

arXiv.org Artificial Intelligence

Large Language Models (LLMs) are increasingly used not only to generate text but also to evaluate it, raising urgent questions about whether their judgments are consistent, unbiased, and robust to framing effects. In this study, we systematically examine inter - and intra - model agreement across four state - of - the - art LLMs - OpenAI o3 - mini, Deepseek Reasone r, xAI Grok 2, and Mistral - tasked with evaluating 4,800 narrative statements on 24 different topics of social, political, and public health relevance, for a total of 192,000 assessments. W e manipulate the disclosed source of each statement to assess how attribution to either another LLM or a human author of specified nationality affects evaluation outcomes. We find that, in the blind condition, different LLMs display a remarkably high degree of inter - and intra - model agreement across topics . However, this alignment breaks down when source framing is introduced. Here we show that attributing statements to Chinese individuals systematically lowers agreement scores across all models, and in particular for Deepseek Reasoner . Our findings reveal that framing effects can deeply affect text evaluation, with significant implications for the integrity, neutrality, and fairness of LLM - mediated information systems.


HyPerAlign: Interpretable Personalized LLM Alignment via Hypothesis Generation

arXiv.org Artificial Intelligence

Alignment algorithms are widely used to align large language models (LLMs) to human users based on preference annotations. Typically these (often divergent) preferences are aggregated over a diverse set of users, resulting in fine-tuned models that are aligned to the ``average-user'' preference. Nevertheless, current models are used by individual users in very specific contexts and situations, emphasizing the need for user-dependent preference control. In this work we address the problem of personalizing LLM outputs to their users. We aim to generate customized responses tailored to specific individuals instead of generic outputs that emulate the collective voices of diverse populations. We propose HyPerAlign, an interpretable and sample-efficient hypothesis-driven personalization approach for LLM models. Given few-shot examples written by a particular user, we first infer hypotheses about their communication strategies, personality, and writing style, then prompt LLM models with these hypotheses and user-specific attributes to generate customized outputs. We conduct experiments on two different personalization tasks, namely authorship attribution and deliberative alignment, with datasets from diverse domains (news articles, blog posts, emails, jailbreaking benchmarks). Results demonstrate the superiority of hypothesis-driven LLM personalization compared to preference-based fine-tuning methods. For authorship attribution, HyPerAlign generations have consistently high win-rates (commonly $> 90\%$) against state-of-the-art preference fine-tuning approaches across diverse user profiles and LLM models. For deliberative alignment, the helpfulness of LLM models is improved by up to $70\%$ on average. Overall, HyPerAlign represents an interpretable and sample-efficient strategy for the personalization of LLM models to individual users.


Log-Augmented Generation: Scaling Test-Time Reasoning with Reusable Computation

arXiv.org Artificial Intelligence

While humans naturally learn and adapt from past experiences, large language models (LLMs) and their agentic counterparts struggle to retain reasoning from previous tasks and apply them in future contexts. To address this limitation, we propose a novel framework, log-augmented generation (LAG) that directly reuses prior computation and reasoning from past logs at test time to enhance model's ability to learn from previous tasks and perform better on new, unseen challenges, all while keeping the system efficient and scalable. Specifically, our system represents task logs using key-value (KV) caches, encoding the full reasoning context of prior tasks while storing KV caches for only a selected subset of tokens. When a new task arises, LAG retrieves the KV values from relevant logs to augment generation. Our approach differs from reflection-based memory mechanisms by directly reusing prior reasoning and computations without requiring additional steps for knowledge extraction or distillation. Our method also goes beyond existing KV caching techniques, which primarily target efficiency gains rather than improving accuracy. Experiments on knowledge- and reasoning-intensive datasets demonstrate that our method significantly outperforms standard agentic systems that do not utilize logs, as well as existing solutions based on reflection and KV cache techniques.


PersonaTAB: Predicting Personality Traits using Textual, Acoustic, and Behavioral Cues in Fully-Duplex Speech Dialogs

arXiv.org Artificial Intelligence

Despite significant progress in neural spoken dialog systems, personality-aware conversation agents -- capable of adapting behavior based on personalities -- remain underexplored due to the absence of personality annotations in speech datasets. We propose a pipeline that preprocesses raw audio recordings to create a dialogue dataset annotated with timestamps, response types, and emotion/sentiment labels. We employ an automatic speech recognition (ASR) system to extract transcripts and timestamps, then generate conversation-level annotations. Leveraging these annotations, we design a system that employs large language models to predict conversational personality. Human evaluators were engaged to identify conversational characteristics and assign personality labels. Our analysis demonstrates that the proposed system achieves stronger alignment with human judgments compared to existing approaches.


Unraveling Interwoven Roles of Large Language Models in Authorship Privacy: Obfuscation, Mimicking, and Verification

arXiv.org Artificial Intelligence

Recent advancements in large language models (LLMs) have been fueled by large scale training corpora drawn from diverse sources such as websites, news articles, and books. These datasets often contain explicit user information, such as person names and addresses, that LLMs may unintentionally reproduce in their generated outputs. Beyond such explicit content, LLMs can also leak identity revealing cues through implicit signals such as distinctive writing styles, raising significant concerns about authorship privacy. There are three major automated tasks in authorship privacy, namely authorship obfuscation (AO), authorship mimicking (AM), and authorship verification (AV). Prior research has studied AO, AM, and AV independently. However, their interplays remain under explored, which leaves a major research gap, especially in the era of LLMs, where they are profoundly shaping how we curate and share user generated content, and the distinction between machine generated and human authored text is also increasingly blurred. This work then presents the first unified framework for analyzing the dynamic relationships among LLM enabled AO, AM, and AV in the context of authorship privacy. We quantify how they interact with each other to transform human authored text, examining effects at a single point in time and iteratively over time. We also examine the role of demographic metadata, such as gender, academic background, in modulating their performances, inter-task dynamics, and privacy risks. All source code will be publicly available.