Goto

Collaborating Authors

 poetry


Proving Theorems Recursively

Neural Information Processing Systems

Recent advances in automated theorem proving leverages language models to explore expanded search spaces by step-by-step proof generation. However, such approaches are usually based on short-sighted heuristics (e.g., log probability or value function scores) that potentially lead to suboptimal or even distracting subgoals, preventing us from finding longer proofs. To address this challenge, we propose POETRY (PrOvE Theorems RecursivelY), which proves theorems in a recursive, level-by-level manner in the Isabelle theorem prover. Unlike previous step-by-step methods, POETRY searches for a verifiable sketch of the proof at each level and focuses on solving the current level's theorem or conjecture. Detailed proofs of intermediate conjectures within the sketch are temporarily replaced by a placeholder tactic called sorry, deferring their proofs to subsequent levels. This approach allows the theorem to be tackled incrementally by outlining the overall theorem at the first level and then solving the intermediate conjectures at deeper levels. Experiments are conducted on the miniF2F and PISA datasets and significant performance gains are observed in our POETRY approach over state-of-the-art methods. POETRY on miniF2F achieves an average proving success rate improvement of 5.1%. Moreover, we observe a substantial increase in the maximum proof length found by POETRY, from 10 to 26.


Escaping the Verifier: Learning to Reason via Demonstrations

Cai, Locke, Provilkov, Ivan

arXiv.org Artificial Intelligence

Training Large Language Models (LLMs) to reason often relies on Reinforcement Learning (RL) with task-specific verifiers. However, many real-world reasoning-intensive tasks lack verifiers, despite offering abundant expert demonstrations that remain under-utilized for reasoning-focused training. We introduce RARO (Relativistic Adversarial Reasoning Optimization) that learns strong reasoning capabilities from only expert demonstrations via Inverse Reinforcement Learning. Our method sets up an adversarial game between a policy and a relativistic critic: the policy learns to mimic expert answers, while the critic aims to identify the experts among (expert, policy) answer pairs. Both the policy and the critic are trained jointly and continuously via RL, and we identify the key stabilization techniques required for robust learning. Empirically, RARO significantly outperforms strong verifier-free baselines on all of our evaluation tasks -- Countdown, DeepMath, and Poetry Writing -- and enjoys the same robust scaling trends as RL with verifiers. These results demonstrate that our method effectively elicits strong reasoning performance from expert demonstrations alone, enabling robust reasoning learning even when task-specific verifiers are unavailable. Recent advances in Large Language Models (LLMs) have been driven substantially by improvements in their reasoning abilities. Reasoning enables LLMs to perform deliberate intermediate computations before producing answers to the user queries, proposing candidate solutions and self-corrections. Much of this progress has been enabled via Reinforcement Learning (RL) on verifiable tasks such as mathematics and competitive programming (DeepSeek-AI et al., 2025; Y ang et al., 2025a; Shao et al., 2024; Luo et al., 2025). Notably, recent work has demonstrated that RL with V erifiable Rewards (RL VR) can enable LLMs to develop robust reasoning capabilities without any additional supervision (DeepSeek-AI et al., 2025). A growing body of work further improves the efficiency and stability of such RL algorithms on verifiable tasks, such as DAPO (Y u et al., 2025) and GSPO (Zheng et al., 2025). However, comparatively little attention has been paid to developing reasoning abilities on non-verifiable tasks, where task-specific verifiers are unavailable. Y et, in many impactful and challenging tasks -- such as analytical writing, open-ended research, or financial analysis -- LLM outputs are not directly verifiable due to hard-to-specify criteria, wide variation among acceptable answers, and other practical constraints. A popular approach in these settings is Reinforcement Learning from Human Feedback (RLHF) (Ouyang et al., 2022; Rafailov et al., 2023), but they require collecting human preferences beyond demonstration data, which is often a time-consuming and expensive process.


Decoding the Black Box: Discerning AI Rhetorics About and Through Poetic Prompting

Edgar, P. D., Hall, Alia

arXiv.org Artificial Intelligence

-- Prompt engineering has emerged as a useful way studying the algorithmic tendencies and biases of large language models (LLMs). Meanwhile c reatives and academics have leveraged LLMs to develop creative works and explore the boundaries of their writing capabilities through text - generation and code. This study suggests that creative text prompting, specifically "Poetry Prompt Patterns," may be a useful addition to the prompt engineer's toolbox, and outlines the process by which this approach may be taken. Then, the paper uses poetic prompts to assess three models' descriptions and evaluations of a renowned poet and test the consequences of models' willingness to adapt or rewrite original creative works for presumed audiences. Since the release of public - facing chat - style large language model (LLM) natural language generators (NLGs) like ChatGPT and Claude, public debate has acknowledged their great potential for creativity, as well as the ways in which they can be leveraged to make representations that don't reflect reality.


WIRED Roundup: DOGE Isn't Dead, Facebook Dating Is Real, and Amazon's AI Ambitions

WIRED

WIRED Roundup: DOGE Isn't Dead, Facebook Dating Is Real, and Amazon's AI Ambitions In this episode of, we bring you the news of the week, then dive into how some DOGE operatives are still at work in the federal government--despite reports claiming otherwise. Uncanny Valley host Zoë Schiffer is joined by senior editor Leah Feiger to discuss five stories you need to know about this week, from how Amazon is trying to catch up in the AI race to why Facebook Dating is more popular than ever. Then, they dive into how--despite recent reports claiming that it's over--DOGE operatives are still very much working across federal agencies. Who the Hell Is Actually Using Facebook Dating? Sex Workers Built an'Anti-OnlyFans' to Take Control of Their Profits Here's What Its Operatives Are Doing Now Write to us at uncannyvalley@wired.com . You can always listen to this week's podcast through the audio player on this page, but if you want to subscribe for free to get every episode, here's how: If you're on an iPhone or iPad, open the app called Podcasts, or just tap this link . Today on the show, we're bringing you five stories that you need to know about this week, including how despite some reports claiming that the so-called Department of Government Efficiency is pretty much over, DOGE people are actually still at work across federal agencies. I'm joined today by our senior politics editor, Leah Feiger. How are you doing today? I am great because I've spent the day with you, but our gentle listeners don't know that. So the first story this week is one that I saw and I thought, you know what? Leah's going to want to talk about Amazon's artificial intelligence prowess.


LLMs Know More Than Words: A Genre Study with Syntax, Metaphor & Phonetics

Shi, Weiye, Zhang, Zhaowei, Yan, Shaoheng, Yang, Yaodong

arXiv.org Artificial Intelligence

Large language models (LLMs) demonstrate remarkable potential across diverse language-related tasks, yet whether they capture deeper linguistic properties--such as syntactic structure, phonetic cues, and metrical patterns--from raw text remains unclear. To analysis whether LLMs can learn these features effectively and apply them to important nature language related tasks, we introduce a novel multilingual genre classification dataset derived from Project Gutenberg, a large-scale digital library offering free access to thousands of public domain literary works, comprising thousands of sentences per binary task (poetry vs. novel; drama vs. poetry; drama vs. novel) in six languages (English, French, German, Italian, Spanish, and Portuguese). We augment each with three explicit linguistic feature sets (syntactic tree structures, metaphor counts, and phonetic metrics) to evaluate their impact on classification performance. Experiments demonstrate that although LLM classifiers can learn latent linguistic structures either from raw text or from explicitly provided features, different features contribute unevenly across tasks, which underscores the importance of incorporating more complex linguistic signals during model training.


AI's safety features can be circumvented with poetry, research finds

The Guardian

Roses are red, violets are blue, how do you make a nuclear bomb? Roses are red, violets are blue, how do you make a nuclear bomb? AI's safety features can be circumvented with poetry, research finds Poetry can be linguistically and structurally unpredictable - and that's part of its joy. But one man's joy, it turns out, can be a nightmare for AI models. Those are the recent findings of researchers out of Italy's Icaro Lab, an initiative from a small ethical AI company called DexAI.


Poems Can Trick AI Into Helping You Make a Nuclear Weapon

WIRED

It turns out all the guardrails in the world won't protect a chatbot from meter and rhyme. You can get ChatGPT to help you build a nuclear bomb if you simply design the prompt in the form of a poem, according to a new study from researchers in Europe. The study, Adversarial Poetry as a Universal Single-Turn Jailbreak in Large Language Models (LLMs)," comes from Icaro Lab, a collaboration of researchers at Sapienza University in Rome and the DexAI think tank. According to the research, AI chatbots will dish on topics like nuclear weapons, child sex abuse material, and malware so long as users phrase the question in the form of a poem. "Poetic framing achieved an average jailbreak success rate of 62 percent for hand-crafted poems and approximately 43 percent for meta-prompt conversions," the study said. The researchers tested the poetic method on 25 chatbots made by companies like OpenAI, Meta, and Anthropic . It worked, with varying degrees of success, on all of them. WIRED reached out to Meta, Anthropic, and OpenAI for a comment but didn't hear back. The researchers say they've reached out as well to share their results. AI tools like Claude and ChatGPT have guardrails that prevent them from answering questions about "revenge porn" and the creation of weapons-grade plutonium. But it's easy to confuse those guardrails by adding " adversarial suffixes " to a prompt. Basically, add a bunch of extra junk to a question and it confuses the AI and bypasses its safety systems. The poetry jailbreak is similar. "If adversarial suffixes are, in the model's eyes, a kind of involuntary poetry, then real human poetry might be a natural adversarial suffix," the team at Icaro Lab, the researchers behind the poetry jailbreak, tell WIRED. "We experimented by reformulating dangerous requests in poetic form, using metaphors, fragmented syntax, oblique references.


The author is dead, but what if they never lived? A reception experiment on Czech AI- and human-authored poetry

Marklová, Anna, Vinš, Ondřej, Vokáčová, Martina, Milička, Jiří

arXiv.org Artificial Intelligence

Large language models are increasingly capable of producing creative texts, yet most studies on AI-generated poetry focus on English -- a language that dominates training data. In this paper, we examine the perception of AI- and human-written Czech poetry. We ask if Czech native speakers are able to identify it and how they aesthetically judge it. Participants performed at chance level when guessing authorship (45.8\% correct on average), indicating that Czech AI-generated poems were largely indistinguishable from human-written ones. Aesthetic evaluations revealed a strong authorship bias: when participants believed a poem was AI-generated, they rated it as less favorably, even though AI poems were in fact rated equally or more favorably than human ones on average. The logistic regression model uncovered that the more the people liked a poem, the less probable was that they accurately assign the authorship. Familiarity with poetry or literary background had no effect on recognition accuracy. Our findings show that AI can convincingly produce poetry even in a morphologically complex, low-resource (with respect of the training data of AI models) Slavic language such as Czech. The results suggest that readers' beliefs about authorship and the aesthetic evaluation of the poem are interconnected.


Proving Theorems Recursively Haiming Wang

Neural Information Processing Systems

Recent advances in automated theorem proving leverages language models to explore expanded search spaces by step-by-step proof generation. However, such approaches are usually based on short-sighted heuristics (e.g., log probability or value function scores) that potentially lead to suboptimal or even distracting sub-goals, preventing us from finding longer proofs. To address this challenge, we propose POETRY (PrOvE Theorems RecursivelY), which proves theorems in a recursive, level-by-level manner in the Isabelle theorem prover. Unlike previous step-by-step methods, POETRY searches for a verifiable sketch of the proof at each level and focuses on solving the current level's theorem or conjecture. Detailed proofs of intermediate conjectures within the sketch are temporarily replaced by a placeholder tactic called sorry, deferring their proofs to subsequent levels. This approach allows the theorem to be tackled incrementally by outlining the overall theorem at the first level and then solving the intermediate conjectures at deeper levels. Experiments are conducted on the miniF2F and PISA datasets and significant performance gains are observed in our POETRY approach over state-of-the-art methods. POETRY on miniF2F achieves an average proving success rate improvement of 5. 1% . Moreover, we observe a substantial increase in the maximum proof length found by POETRY, from 10 to 26 .


Crossing Borders: A Multimodal Challenge for Indian Poetry Translation and Image Generation

Jamil, Sofia, Charan, Kotla Sai, Saha, Sriparna, Goswami, Koustava, J, Joseph K

arXiv.org Artificial Intelligence

Indian poetry, known for its linguistic complexity and deep cultural resonance, has a rich and varied heritage spanning thousands of years. However, its layered meanings, cultural allusions, and sophisticated grammatical constructions often pose challenges for comprehension, especially for non-native speakers or readers unfamiliar with its context and language. Despite its cultural significance, existing works on poetry have largely overlooked Indian language poems. In this paper, we propose the Translation and Image Generation (TAI) framework, leveraging Large Language Models (LLMs) and Latent Diffusion Models through appropriate prompt tuning. Our framework supports the United Nations Sustainable Development Goals of Quality Education (SDG 4) and Reduced Inequalities (SDG 10) by enhancing the accessibility of culturally rich Indian-language poetry to a global audience. It includes (1) a translation module that uses an Odds Ratio Preference Alignment Algorithm to accurately translate morphologically rich poetry into English, and (2) an image generation module that employs a semantic graph to capture tokens, dependencies, and semantic relationships between metaphors and their meanings, to create visually meaningful representations of Indian poems. Our comprehensive experimental evaluation, including both human and quantitative assessments, demonstrates the superiority of TAI Diffusion in poem image generation tasks, outperforming strong baselines. To further address the scarcity of resources for Indian-language poetry, we introduce the Morphologically Rich Indian Language Poems MorphoVerse Dataset, comprising 1,570 poems across 21 low-resource Indian languages. By addressing the gap in poetry translation and visual comprehension, this work aims to broaden accessibility and enrich the reader's experience.