vignette
Left Leaning Models: How AI Evaluates Economic Policy?
Would artificial intelligence (AI) cut interest rates or adopt conservative monetary policy? Would it deregulate or opt for a more controlled economy? As AI use by economic policymakers, academics, and market participants grows exponentially, it is becoming critical to understand AI preferences over economic policy. However, these preferences are not yet systematically evaluated and remain a black box. This paper makes a conjoint experiment on leading large language models (LLMs) from OpenAI, Anthropic, and Google, asking them to evaluate economic policy under multi-factor constraints. The results are remarkably consistent across models: most LLMs exhibit a strong preference for high growth, low unemployment, and low inequality over traditional macroeconomic concerns such as low inflation and low public debt. Scenario-specific experiments show that LLMs are sensitive to context but still display strong preferences for low unemployment and low inequality even in monetary-policy settings. Numerical sensitivity tests reveal intuitive responses to quantitative changes but also uncover non-linear patterns such as loss aversion.
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
- Europe > United Kingdom > England > Oxfordshire > Oxford (0.04)
- Europe > France (0.04)
- (2 more...)
- Research Report > Experimental Study (0.94)
- Research Report > New Finding (0.68)
Orchestrator Multi-Agent Clinical Decision Support System for Secondary Headache Diagnosis in Primary Care
Wu, Xizhi, Garduno-Rapp, Nelly Estefanie, Rousseau, Justin F, Thakkallapally, Mounika, Zhang, Hang, Ji, Yuelyu, Visweswaran, Shyam, Peng, Yifan, Wang, Yanshan
Unlike most primary headaches, secondary headaches need specialized care and can have devastating consequences if not treated promptly. Clinical guidelines highlight several 'red flag' features, such as thunderclap onset, meningismus, papilledema, focal neurologic deficits, signs of temporal arteritis, systemic illness, and the 'worst headache of their life' presentation. Despite these guidelines, determining which patients require urgent evaluation remains challenging in primary care settings. Clinicians often work with limited time, incomplete information, and diverse symptom presentations, which can lead to under-recognition and inappropriate care. We present a large language model (LLM)-based multi-agent clinical decision support system built on an orchestrator-specialist architecture, designed to perform explicit and interpretable secondary headache diagnosis from free-text clinical vignettes. The multi-agent system decomposes diagnosis into seven domain-specialized agents, each producing a structured and evidence-grounded rationale, while a central orchestrator performs task decomposition and coordinates agent routing. We evaluated the multi-agent system using 90 expert-validated secondary headache cases and compared its performance with a single-LLM baseline across two prompting strategies: question-based prompting (QPrompt) and clinical practice guideline-based prompting (GPrompt). We tested five open-source LLMs (Qwen-30B, GPT-OSS-20B, Qwen-14B, Qwen-8B, and Llama-3.1-8B), and found that the orchestrated multi-agent system with GPrompt consistently achieved the highest F1 scores, with larger gains in smaller models. These findings demonstrate that structured multi-agent reasoning improves accuracy beyond prompt engineering alone and offers a transparent, clinically aligned approach for explainable decision support in secondary headache diagnosis.
- North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.05)
- North America > United States > New York > New York County > New York City (0.04)
- North America > United States > Utah (0.04)
- (2 more...)
- Research Report > New Finding (0.66)
- Research Report > Experimental Study (0.46)
- Health & Medicine > Diagnostic Medicine (1.00)
- Health & Medicine > Therapeutic Area > Neurology > Headaches (0.46)
Echoes of AI Harms: A Human-LLM Synergistic Framework for Bias-Driven Harm Anticipation
Tantalaki, Nicoleta, Vei, Sophia, Vakali, Athena
The growing influence of Artificial Intelligence (AI) systems on decision-making in critical domains has exposed their potential to cause significant harms, often rooted in biases embedded across the AI lifecycle. While existing frameworks and taxonomies document bias or harms in isolation, they rarely establish systematic links between specific bias types and the harms they cause, particularly within real-world sociotechnical contexts. Technical fixes proposed to address AI biases are ill-equipped to address them and are typically applied after a system has been developed or deployed, offering limited preventive value. We propose ECHO, a novel framework for proactive AI harm anticipation through the systematic mapping of AI bias types to harm outcomes across diverse stakeholder and domain contexts. ECHO follows a modular workflow encompassing stakeholder identification, vignette-based presentation of biased AI systems, and dual (human-LLM) harm annotation, integrated within ethical matrices for structured interpretation. This human-centered approach enables early-stage detection of bias-to-harm pathways, guiding AI design and governance decisions from the outset. We validate ECHO in two high-stakes domains (disease diagnosis and hiring), revealing domain-specific, bias-to-harm patterns and demonstrating ECHO's potential to support anticipatory governance of AI systems
- North America > United States > New York > New York County > New York City (0.05)
- Europe > Greece > Central Macedonia > Thessaloniki (0.04)
- Europe > Switzerland (0.04)
- (22 more...)
- Research Report > New Finding (1.00)
- Research Report > Experimental Study (1.00)
- Law (1.00)
- Information Technology > Security & Privacy (0.93)
- Health & Medicine > Therapeutic Area (0.92)
- (2 more...)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Issues > Social & Ethical Issues (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)
- North America > United States > California > Santa Clara County > Palo Alto (0.04)
- North America > Canada > Ontario > Toronto (0.04)
- Asia > Middle East > Jordan (0.04)
- (8 more...)
- Law > Criminal Law (1.00)
- Information Technology > Security & Privacy (1.00)
- Banking & Finance (1.00)
- (2 more...)
Retrieval-Augmented Generation of Pediatric Speech-Language Pathology vignettes: A Proof-of-Concept Study
Clinical vignettes are essential educational tools in speech-language pathology (SLP), but manual creation is time-intensive. While general-purpose large language models (LLMs) can generate text, they lack domain-specific knowledge, leading to hallucinations and requiring extensive expert revision. This study presents a proof-of-concept system integrating retrieval-augmented generation (RAG) with curated knowledge bases to generate pediatric SLP case materials. A multi-model RAG-based system was prototyped integrating curated domain knowledge with engineered prompt templates, supporting five commercial (GPT-4o, Claude 3.5 Sonnet, Gemini 2.5 Pro) and open-source (Llama 3.2, Qwen 2.5-7B) LLMs. Seven test scenarios spanning diverse disorder types and grade levels were systematically designed. Generated cases underwent automated quality assessment using a multi-dimensional rubric evaluating structural completeness, internal consistency, clinical appropriateness, and IEP goal/session note quality. This proof-of-concept demonstrates technical feasibility for RAG-augmented generation of pediatric SLP vignettes. Commercial models showed marginal quality advantages, but open-source alternatives achieved acceptable performance, suggesting potential for privacy-preserving institutional deployment. Integration of curated knowledge bases enabled content generation aligned with professional guidelines. Extensive validation through expert review, student pilot testing, and psychometric evaluation is required before educational or research implementation. Future applications may extend to clinical decision support, automated IEP goal generation, and clinical reflection training.
- North America > United States > Nebraska (0.04)
- North America > United States > Iowa (0.04)
- Health & Medicine > Diagnostic Medicine (1.00)
- Education > Educational Setting > K-12 Education (1.00)
- Health & Medicine > Therapeutic Area > Pediatrics/Neonatology (0.80)
Making LLMs Reliable When It Matters Most: A Five-Layer Architecture for High-Stakes Decisions
Current large language models (LLMs) excel in verifiable domains where outputs can be checked before action but prove less reliable for high-stakes strategic decisions with uncertain outcomes. This gap, driven by mutually reinforcing cognitive biases in both humans and artificial intelligence (AI) systems, threatens the defensibility of valuations and sustainability of investments in the sector. This report describes a framework emerging from systematic qualitative assessment across 7 frontier-grade LLMs and 3 market-facing venture vignettes under time pressure. Detailed prompting specifying decision partnership and explicitly instructing avoidance of sycophancy, confabulation, solution drift, and nihilism achieved initial partnership state but failed to maintain it under operational pressure. Sustaining protective partnership state required an emergent 7-stage calibration sequence, built upon a 4-stage initialization process, within a 5-layer protection architecture enabling bias self-monitoring, human-AI adversarial challenge, partnership state verification, performance degradation detection, and stakeholder protection. Three discoveries resulted: partnership state is achievable through ordered calibration but requires emergent maintenance protocols; reliability degrades when architectural drift and context exhaustion align; and dissolution discipline prevents costly pursuit of fundamentally wrong directions. Cross-model validation revealed systematic performance differences across LLM architectures. This approach demonstrates that human-AI teams can achieve cognitive partnership capable of preventing avoidable regret in high-stakes decisions, addressing return-on-investment expectations that depend on AI systems supporting consequential decision-making without introducing preventable cognitive traps when verification arrives too late.
- North America > United States > California > Los Angeles County > Los Angeles (0.14)
- Europe > Switzerland (0.04)
- North America > Canada > Ontario > Toronto (0.04)
- Europe > United Kingdom > England (0.04)
- Health & Medicine (1.00)
- Banking & Finance > Trading (0.68)
Choreographing Trash Cans: On Speculative Futures of Weak Robots in Public Spaces
Axelsson, Minja, Sikau, Lea Luka
Michio Okada first conceptualised "weak robots", which have limited capabilities themselves, and are framed as objects or "social others" which people are invited to assist and take care of. In Okada's work, such robots are used to invite pro-social behaviour from people, such as encouraging them to pick up trash to assist a trash can robot (Okada (2022)). We conceptualise human-robot interaction (HRI) as a stage where weak robots-- designed to be "cute" and vulnerable--play the role of incidental actors that subvert the person engaging with them. Caudwell and Lacey (2020) argue that cuteness as a design choice for robots can encourage users to trust and form relationships with those robots, which introduces ambivalent power dynamics through the production of intimacy . In fact, cuteness can also be seen as a deceptive or "dark" pattern, due to the utilisation of cuteness to prompt affective responses which can be used to collect emotional data, as well as some degree of reduction of user agency (Lacey and Caudwell (2019)). The ability and affordances of cute and weak robots to influence user behaviour merits the discussion of their ethicality, which we do in this paper through design fiction. Unlike traditional HRI research, often confined to laboratory settings, our focus is on spontaneous, real-world interactions that transform everyday environments into sites of performative potential. We argue that the theatricality of these encounters is central to understanding their impact: the presence of a weak and/or cute robot, such as the trash can robot, developed by Okada and the Interaction and Communication Design Lab of the T oyohashi University of T echnology, acts as a disruptive interloper that introduces an observer's effect and, thus, affects the human interlocutors. First, we examine the concept of weak robots through the lens of performativity theory as well as concepts of machine (dys)function.
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.14)
- North America > United States > Minnesota (0.04)
- North America > United States > Indiana (0.04)
Simulating Viva Voce Examinations to Evaluate Clinical Reasoning in Large Language Models
Chiu, Christopher, Pitis, Silviu, van der Schaar, Mihaela
Clinical reasoning in medicine is a hypothesis-driven process where physicians refine diagnoses from limited information through targeted history, physical examination, and diagnostic investigations. In contrast, current medical benchmarks for large language models (LLMs) primarily assess knowledge recall through single-turn questions, where complete clinical information is provided upfront. To address this gap, we introduce VivaBench, a multi-turn benchmark that evaluates sequential clinical reasoning in LLM agents. Our dataset consists of 1762 physician-curated clinical vignettes structured as interactive scenarios that simulate a (oral) examination in medical training, requiring agents to actively probe for relevant findings, select appropriate investigations, and synthesize information across multiple steps to reach a diagnosis. While current LLMs demonstrate competence in diagnosing conditions from well-described clinical presentations, their performance degrades significantly when required to navigate iterative diagnostic reasoning under uncertainty in our evaluation. Our analysis identified several failure modes that mirror common cognitive errors in clinical practice, including: (1) fixation on initial hypotheses, (2) inappropriate investigation ordering, (3) premature diagnostic closure, and (4) failing to screen for critical conditions. These patterns reveal fundamental limitations in how current LLMs reason and make decisions under uncertainty. Through VivaBench, we provide a standardized benchmark for evaluating conversational medical AI systems for real-world clinical decision support. Beyond medical applications, we contribute to the larger corpus of research on agentic AI by demonstrating how sequential reasoning trajectories can diverge in complex decision-making environments.
- North America > Canada > Ontario > Toronto (0.14)
- Oceania > Australia (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- Research Report > Experimental Study (0.93)
- Research Report > New Finding (0.92)
- North America > Canada (0.14)
- North America > United States > California (0.14)
- Asia > Middle East (0.14)
- (2 more...)
- Law > Criminal Law (1.00)
- Information Technology > Security & Privacy (1.00)
- Banking & Finance (1.00)
- (2 more...)
Discerning What Matters: A Multi-Dimensional Assessment of Moral Competence in LLMs
Kilov, Daniel, Hendy, Caroline, Guyot, Secil Yanik, Snoswell, Aaron J., Lazar, Seth
Moral competence is the ability to act in accordance with moral principles. As large language models (LLMs) are increasingly deployed in situations demanding moral competence, there is increasing interest in evaluating this ability empirically. We review existing literature and identify three significant shortcoming: (i) Over-reliance on prepackaged moral scenarios with explicitly highlighted moral features; (ii) Focus on verdict prediction rather than moral reasoning; and (iii) Inadequate testing of models' (in)ability to recognize when additional information is needed. Grounded in philosophical research on moral skill, we then introduce a novel method for assessing moral competence in LLMs. Our approach moves beyond simple verdict comparisons to evaluate five dimensions of moral competence: identifying morally relevant features, weighting their importance, assigning moral reasons to these features, synthesizing coherent moral judgments, and recognizing information gaps. We conduct two experiments comparing six leading LLMs against non-expert humans and professional philosophers. In our first experiment using ethical vignettes standard to existing work, LLMs generally outperformed non-expert humans across multiple dimensions of moral reasoning. However, our second experiment, featuring novel scenarios designed to test moral sensitivity by embedding relevant features among irrelevant details, revealed a striking reversal: several LLMs performed significantly worse than humans. Our findings suggest that current evaluations may substantially overestimate LLMs' moral reasoning capabilities by eliminating the task of discerning moral relevance from noisy information, which we take to be a prerequisite for genuine moral skill. This work provides a more nuanced framework for assessing AI moral competence and highlights important directions for improving moral competence in advanced AI systems.
- Europe > United Kingdom (0.04)
- North America > Canada (0.04)
- Asia > India (0.04)
- (4 more...)
- Research Report > New Finding (1.00)
- Research Report > Experimental Study (1.00)
- Health & Medicine > Therapeutic Area (0.93)
- Education > Educational Setting > Higher Education (0.46)