Goto

Collaborating Authors

 haiku


The Influence of Human-inspired Agentic Sophistication in LLM-driven Strategic Reasoners

Trencsenyi, Vince, Mensfelt, Agnieszka, Stathis, Kostas

arXiv.org Artificial Intelligence

The rapid rise of large language models (LLMs) has shifted artificial intelligence (AI) research toward agentic systems, motivating the use of weaker and more flexible notions of agency. However, this shift raises key questions about the extent to which LLM-based agents replicate human strategic reasoning, particularly in game-theoretic settings. In this context, we examine the role of agentic sophistication in shaping artificial reasoners' performance by evaluating three agent designs: a simple game-theoretic model, an unstructured LLM-as-agent model, and an LLM integrated into a traditional agentic framework. Using guessing games as a testbed, we benchmarked these agents against human participants across general reasoning patterns and individual role-based objectives. Furthermore, we introduced obfuscated game scenarios to assess agents' ability to generalise beyond training distributions. Our analysis, covering over 2000 reasoning samples across 25 agent configurations, shows that human-inspired cognitive structures can enhance LLM agents' alignment with human strategic behaviour. Still, the relationship between agentic design complexity and human-likeness is non-linear, highlighting a critical dependence on underlying LLM capabilities and suggesting limits to simple architectural augmentation.


Can AI Freelancers Compete? Benchmarking Earnings, Reliability, and Task Success at Scale

Noever, David, McKee, Forrest

arXiv.org Artificial Intelligence

This study explores Large Language Models (LLMs) as autonomous agents for real-world tasks, including freelance software development. This work presents a new benchmark that evaluates LLMs on freelance programming and data analysis tasks derived from economic data. We construct the benchmark using synthetic tasks created from a Kaggle Freelancer dataset of job postings, with all job prices standardized to USD (median fixed-project price around $250, and an average of $306). Each task is accompanied by structured input-output test cases and an estimated price tag, enabling automated correctness checking and a monetary performance valuation. This approach is inspired by OpenAI's recent SWE-Lancer benchmark (1,400 real Upwork tasks worth $1M total). Still, our framework simplifies evaluation using programmatically testable tasks and predicted price values, making it highly scalable and repeatable. On this benchmark, we evaluate four modern LLMs - Claude 3.5 Haiku, GPT-4o-mini, Qwen 2.5, and Mistral. We report each model's accuracy (task success rate and test-case pass rate) and the total "freelance earnings" it achieves (sum of prices of solved tasks). Our results show that Claude 3.5 Haiku performs best, earning approximately $1.52 million USD, followed closely by GPT-4o-mini at $1.49 million, then Qwen 2.5 ($1.33M) and Mistral ($0.70M). We analyze the distribution of errors per task and observe that the strongest models solve the most tasks and rarely fail completely on any project. We discuss the implications of these results for the feasibility of AI as a freelance developer, the advantages and limitations of our automated benchmark approach, and the gap between performance on structured tasks versus the true complexity of real-world freelance jobs.


LLM Agents Are Hypersensitive to Nudges

Cherep, Manuel, Maes, Pattie, Singh, Nikhil

arXiv.org Artificial Intelligence

LLMs are being set loose in complex, real-world environments involving sequential decision-making and tool use. Often, this involves making choices on behalf of human users. However, not much is known about the distribution of such choices, and how susceptible they are to different choice architectures. We perform a case study with a few such LLM models on a multi-attribute tabular decision-making problem, under canonical nudges such as the default option, suggestions, and information highlighting, as well as additional prompting strategies. We show that, despite superficial similarities to human choice distributions, such models differ in subtle but important ways. First, they show much higher susceptibility to the nudges. Second, they diverge in points earned, being affected by factors like the idiosyncrasy of available prizes. Third, they diverge in information acquisition strategies: e.g. incurring substantial cost to reveal too much information, or selecting without revealing any. Moreover, we show that simple prompt strategies like zero-shot chain of thought (CoT) can shift the choice distribution, and few-shot prompting with human data can induce greater alignment. Yet, none of these methods resolve the sensitivity of these models to nudges. Finally, we show how optimal nudges optimized with a human resource-rational model can similarly increase LLM performance for some models. All these findings suggest that behavioral tests are needed before deploying models as agents or assistants acting on behalf of users in complex environments.


Evaluating Cost-Accuracy Trade-offs in Multimodal Search Relevance Judgements

Terragni, Silvia, Cuong, Hoang, Daiber, Joachim, Gudipati, Pallavi, Mendes, Pablo N.

arXiv.org Artificial Intelligence

Large Language Models (LLMs) have demonstrated potential as effective search relevance evaluators. However, there is a lack of comprehensive guidance on which models consistently perform optimally across various contexts or within specific use cases. In this paper, we assess several LLMs and Multimodal Language Models (MLLMs) in terms of their alignment with human judgments across multiple multimodal search scenarios. Our analysis investigates the trade-offs between cost and accuracy, highlighting that model performance varies significantly depending on the context. Interestingly, in smaller models, the inclusion of a visual component may hinder performance rather than enhance it. These findings highlight the complexities involved in selecting the most appropriate model for practical applications.


Kiss up, Kick down: Exploring Behavioral Changes in Multi-modal Large Language Models with Assigned Visual Personas

Sun, Seungjong, Lee, Eungu, Baek, Seo Yeon, Hwang, Seunghyun, Lee, Wonbyung, Nan, Dongyan, Jansen, Bernard J., Kim, Jang Hyun

arXiv.org Artificial Intelligence

This study is the first to explore whether multi-modal large language models (LLMs) can align their behaviors with visual personas, addressing a significant gap in the literature that predominantly focuses on text-based personas. We developed a novel dataset of 5K fictional avatar images for assignment as visual personas to LLMs, and analyzed their negotiation behaviors based on the visual traits depicted in these images, with a particular focus on aggressiveness. The results indicate that LLMs assess the aggressiveness of images in a manner similar to humans and output more aggressive negotiation behaviors when prompted with an aggressive visual persona. Interestingly, the LLM exhibited more aggressive negotiation behaviors when the opponent's image appeared less aggressive than their own, and less aggressive behaviors when the opponents image appeared more aggressive.


Sonnet or Not, Bot? Poetry Evaluation for Large Models and Datasets

Walsh, Melanie, Preus, Anna, Antoniak, Maria

arXiv.org Artificial Intelligence

Large language models (LLMs) can now generate and recognize text in a wide range of styles and genres, including highly specialized, creative genres like poetry. But what do LLMs really know about poetry? What can they know about poetry? We develop a task to evaluate how well LLMs recognize a specific aspect of poetry, poetic form, for more than 20 forms and formal elements in the English language. Poetic form captures many different poetic features, including rhyme scheme, meter, and word or line repetition. We use this task to reflect on LLMs' current poetic capabilities, as well as the challenges and pitfalls of creating NLP benchmarks for poetry and for other creative tasks. In particular, we use this task to audit and reflect on the poems included in popular pretraining datasets. Our findings have implications for NLP researchers interested in model evaluation, digital humanities and cultural analytics scholars, and cultural heritage professionals.


Anthropic says its new Claude 3 AI chatbot scores better on key benchmarks than GPT-4

Engadget

The battle between AI chatbots is more than a two-horse race. Anthropic, the company formed by several ex-OpenAI employees, claims its new Claude 3 language model outperforms ChatGPT and Google's Gemini in several key industry benchmarks. It even hit "near-human" levels on some tasks, the company wrote in a blog. There are three new chatbots under the Claude 3 umbrella, including Haiku, Sonnet, and Opus. Meanwhile, Opus is the largest and most powerful LLM and will be available with a 20 per month subscription via the "Claude Pro" service. It's also multi-modal, so it can work with both text and image inputs, unlike past versions.



Snapchat launches an AI chatbot powered by OpenAI's GPT technology

#artificialintelligence

Snapchat is the latest company to get in on the AI frenzy. The company announced today that it's launching "My AI," a new chatbot running the latest version of OpenAI's GPT technology that it has customized for its users. My AI is now available as an experimental feature for Snapchat, the social network's $3.99 a month subscription service. The new chatbot will be pinned to the top of the Chat tab. My AI can do things like help answer a trivia question or write a haiku.


Basho in the machine: Humans find attributes of beauty and discomfort in algorithmic haiku -- ScienceDaily

#artificialintelligence

The gap between human creativity and artificial intelligence seems to be narrowing. Previous studies have compared AI-generated versus human-written poems and whether people can distinguish between them. Now, a study led by Yoshiyuki Ueda at Kyoto University Institute for the Future of Human and Society, has shown AI's potential in creating literary art such as haiku -- the shortest poetic form in the world -- rivaling that of humans without human help. Ueda's team compared AI-generated haiku without human intervention, also known as human out of the loop, or HOTL, with a contrasting method known as human in the loop, or HITL. The project involved 385 participants, each of whom evaluated 40 haiku poems -- 20 each of HITL and HOTL -- plus 40 composed entirely by professional haiku writers.