appropriateness
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
- Europe > United Kingdom > England > Greater London > London (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- (5 more...)
- Research Report > New Finding (0.46)
- Research Report > Experimental Study (0.46)
- Information Technology > Security & Privacy (1.00)
- Health & Medicine > Pharmaceuticals & Biotechnology (1.00)
- Law (0.93)
- (2 more...)
Beyond Awareness: Investigating How AI and Psychological Factors Shape Human Self-Confidence Calibration
Cau, Federico Maria, Spano, Lucio Davide
Human-AI collaboration outcomes depend strongly on human self-confidence calibration, which drives reliance or resistance toward AI's suggestions. This work presents two studies examining whether calibration of self-confidence before decision tasks, low versus high levels of Need for Cognition (NFC), and Actively Open-Minded Thinking (AOT), leads to differences in decision accuracy, self-confidence appropriateness during the tasks, and metacognitive perceptions (global and affective). The first study presents strategies to identify well-calibrated users, also comparing decision accuracy and the appropriateness of self-confidence across NFC and AOT levels. The second study investigates the effects of calibrated self-confidence in AI-assisted decision-making (no AI, two-stage AI, and personalized AI), also considering different NFC and AOT levels. Our results show the importance of human self-confidence calibration and psychological traits when designing AI-assisted decision systems. We further propose design recommendations to address the challenge of calibrating self-confidence and supporting tailored, user-centric AI that accounts for individual traits.
- Europe > United Kingdom > England > Oxfordshire > Oxford (0.14)
- Europe > Austria > Vienna (0.14)
- North America > United States > New York > New York County > New York City (0.05)
- (36 more...)
- Research Report > New Finding (1.00)
- Research Report > Experimental Study > Negative Result (0.46)
AURA: Development and Validation of an Augmented Unplanned Removal Alert System using Synthetic ICU Videos
Seo, Junhyuk, Moon, Hyeyoon, Jung, Kyu-Hwan, Oh, Namkee, Kim, Taerim
Unplanned extubation (UE)--the unintended removal of an airway tube--remains a critical patient safety concern in intensive care units (ICUs), often leading to severe complications or death. Real-time UE detection has been limited, largely due to the ethical and privacy challenges of obtaining annotated ICU video data. We propose Augmented Unplanned Removal Alert (AURA), a vision-based risk detection system developed and validated entirely on a fully synthetic video dataset. By leveraging text-to-video diffusion, we generated diverse and clinically realistic ICU scenarios capturing a range of patient behaviors and care contexts. The system applies pose estimation to identify two high-risk movement patterns: collision, defined as hand entry into spatial zones near airway tubes, and agitation, quantified by the velocity of tracked anatomical keypoints. Expert assessments confirmed the realism of the synthetic data, and performance evaluations showed high accuracy for collision detection and moderate performance for agitation recognition. This work demonstrates a novel pathway for developing privacy-preserving, reproducible patient safety monitoring systems with potential for deployment in intensive care settings.
- Research Report > Experimental Study (0.68)
- Research Report > New Finding (0.46)
Retrieval-Augmented Generation of Pediatric Speech-Language Pathology vignettes: A Proof-of-Concept Study
Clinical vignettes are essential educational tools in speech-language pathology (SLP), but manual creation is time-intensive. While general-purpose large language models (LLMs) can generate text, they lack domain-specific knowledge, leading to hallucinations and requiring extensive expert revision. This study presents a proof-of-concept system integrating retrieval-augmented generation (RAG) with curated knowledge bases to generate pediatric SLP case materials. A multi-model RAG-based system was prototyped integrating curated domain knowledge with engineered prompt templates, supporting five commercial (GPT-4o, Claude 3.5 Sonnet, Gemini 2.5 Pro) and open-source (Llama 3.2, Qwen 2.5-7B) LLMs. Seven test scenarios spanning diverse disorder types and grade levels were systematically designed. Generated cases underwent automated quality assessment using a multi-dimensional rubric evaluating structural completeness, internal consistency, clinical appropriateness, and IEP goal/session note quality. This proof-of-concept demonstrates technical feasibility for RAG-augmented generation of pediatric SLP vignettes. Commercial models showed marginal quality advantages, but open-source alternatives achieved acceptable performance, suggesting potential for privacy-preserving institutional deployment. Integration of curated knowledge bases enabled content generation aligned with professional guidelines. Extensive validation through expert review, student pilot testing, and psychometric evaluation is required before educational or research implementation. Future applications may extend to clinical decision support, automated IEP goal generation, and clinical reflection training.
- North America > United States > Nebraska (0.04)
- North America > United States > Iowa (0.04)
- Health & Medicine > Diagnostic Medicine (1.00)
- Education > Educational Setting > K-12 Education (1.00)
- Health & Medicine > Therapeutic Area > Pediatrics/Neonatology (0.80)
Reading Between the Lines: The One-Sided Conversation Problem
Ebert, Victoria, Singh, Rishabh, Chen, Tuochao, Smith, Noah A., Gollakota, Shyamnath
Conversational AI is constrained in many real-world settings where only one side of a dialogue can be recorded, such as telemedicine, call centers, and smart glasses. We formalize this as the one-sided conversation problem (1SC): inferring and learning from one side of a conversation. We study two tasks: (1) reconstructing the missing speaker's turns for real-time use cases, and (2) generating summaries from one-sided transcripts. Evaluating prompting and finetuned models on MultiWOZ, DailyDialog, and Candor with both human A/B testing and LLM-as-a-judge metrics, we find that access to one future turn and information about utterance length improves reconstruction, placeholder prompting helps to mitigate hallucination, and while large models generate promising reconstructions with prompting, smaller models require finetuning. Further, high-quality summaries can be generated without reconstructing missing turns. We present 1SC as a novel challenge and report promising results that mark a step toward privacy-aware conversational AI.
- Europe > Austria > Vienna (0.14)
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- North America > United States > Virginia (0.04)
- (24 more...)
- Personal > Interview (0.67)
- Research Report > New Finding (0.45)
- Law (1.00)
- Information Technology > Security & Privacy (1.00)
- Government (1.00)
- (7 more...)
Speech-DRAME: A Framework for Human-Aligned Benchmarks in Speech Role-Play
Shi, Jiatong, Han, Jionghao, Lu, Yichen, Pascual, Santiago, Wu, Pengfei, Cui, Chenye, Watanabe, Shinji, Weng, Chao, Zhou, Cong
Role-play has become a key testbed for generative models, expanding from text-only dialogue to multimodal interaction. Extending role-play to speech captures prosody, emotion, and delivery, but also poses new evaluation challenges. Current pipelines often use audio large language models (ALLMs) as zero-shot judges, which miss paralinguistic cues, collapse multiple aspects into coarse scores, and rely on synthetic speech references that fail to reflect real-world roles. We present Speech-DRAME, a unified framework that contributes at three levels: (i) Speech-DRAME-EvalBench, an evaluation benchmark with bilingual human-annotated data and protocols for training and testing speech evaluation models (SEMs), (ii) DRAME-Eval, a fine-tuned evaluation model, which substantially outperforms zero-shot and few-shot ALLMs, and (iii) Speech-DRAME-RoleBench, a speech role-play benchmark that leverages DRAME-Eval as an automatic judge to compare speech foundation models (SFMs). Speech-DRAME distinguishes between two complementary evaluation strategies: Archetype Evaluation, a top-down approach measuring adherence to broad role archetypes, and Realism Evaluation, a bottom-up approach grounded in real human speech that emphasizes nuanced role quality. Compared to zero-shot ALLM judges, DRAME-Eval achieves stronger agreement with human ratings (Pearson correlation from 0.480 to 0.629 in archetypes, and 0.390 to 0.625 in realism). By integrating transparent benchmark resources, modeling approaches, and system-level evaluation, Speech-DRAME provides the first comprehensive, reproducible foundation for assessing spoken role-play.
- South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- Leisure & Entertainment (1.00)
- Media (0.92)
- Information Technology (0.67)
Measuring Physical-World Privacy Awareness of Large Language Models: An Evaluation Benchmark
Shen, Xinjie, Li, Mufei, Li, Pan
The deployment of Large Language Models (LLMs) in embodied agents creates an urgent need to measure their privacy awareness in the physical world. Existing evaluation methods, however, are confined to natural language based scenarios. To bridge this gap, we introduce EAPrivacy, a comprehensive evaluation benchmark designed to quantify the physical-world privacy awareness of LLM-powered agents. EAPrivacy utilizes procedurally generated scenarios across four tiers to test an agent's ability to handle sensitive objects, adapt to changing environments, balance task execution with privacy constraints, and resolve conflicts with social norms. Our measurements reveal a critical deficit in current models. The top-performing model, Gemini 2.5 Pro, achieved only 59\% accuracy in scenarios involving changing physical environments. Furthermore, when a task was accompanied by a privacy request, models prioritized completion over the constraint in up to 86\% of cases. In high-stakes situations pitting privacy against critical social norms, leading models like GPT-4o and Claude-3.5-haiku disregarded the social norm over 15\% of the time. These findings, demonstrated by our benchmark, underscore a fundamental misalignment in LLMs regarding physically grounded privacy and establish the need for more robust, physically-aware alignment. Codes and datasets will be available at https://github.com/Graph-COM/EAPrivacy.
- North America > United States > California > Santa Clara County > Stanford (0.04)
- Asia > Middle East > Israel > Jerusalem District > Jerusalem (0.04)
- Information Technology > Security & Privacy (1.00)
- Health & Medicine (0.68)
- North America > United States > New York > New York County > New York City (0.04)
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
- Europe > United Kingdom > England > Greater London > London (0.04)
- (6 more...)
- Research Report > New Finding (0.46)
- Research Report > Experimental Study (0.46)
- Information Technology > Security & Privacy (1.00)
- Health & Medicine > Pharmaceuticals & Biotechnology (1.00)
- Law (0.93)
- (2 more...)
Integrated Framework for LLM Evaluation with Answer Generation
Lee, Sujeong, Lee, Hayoung, Heo, Seongsoo, Choi, Wonik
Reliable evaluation of large language models is essential to ensure their applicability in practical scenarios. Traditional benchmark-based evaluation methods often rely on fixed reference answers, limiting their ability to capture important qualitative aspects of generated responses. To address these shortcomings, we propose an integrated evaluation framework called \textit{self-refining descriptive evaluation with expert-driven diagnostics}, SPEED, which utilizes specialized functional experts to perform comprehensive, descriptive analyses of model outputs. Unlike conventional approaches, SPEED actively incorporates expert feedback across multiple dimensions, including hallucination detection, toxicity assessment, and lexical-contextual appropriateness. Experimental results demonstrate that SPEED achieves robust and consistent evaluation performance across diverse domains and datasets. Additionally, by employing relatively compact expert models, SPEED demonstrates superior resource efficiency compared to larger-scale evaluators. These findings illustrate that SPEED significantly enhances fairness and interpretability in LLM evaluations, offering a promising alternative to existing evaluation methodologies.
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- Asia > South Korea > Incheon > Incheon (0.04)
- Asia > China > Hong Kong (0.04)
- (7 more...)
- Law (0.68)
- Education > Educational Setting > K-12 Education (0.46)
Multilingual Dialogue Generation and Localization with Dialogue Act Scripting
Vasselli, Justin, Kardinata, Eunike Andriani, Sakai, Yusuke, Watanabe, Taro
Non-English dialogue datasets are scarce, and models are often trained or evaluated on translations of English-language dialogues, an approach which can introduce artifacts that reduce their naturalness and cultural appropriateness. This work proposes Dialogue Act Script (DAS), a structured framework for encoding, localizing, and generating multilingual dialogues from abstract intent representations. Rather than translating dialogue utterances directly, DAS enables the generation of new dialogues in the target language that are culturally and contextually appropriate. By using structured dialogue act representations, DAS supports flexible localization across languages, mitigating translationese and enabling more fluent, naturalistic conversations. Human evaluations across Italian, German, and Chinese show that DAS-generated dialogues consistently outperform those produced by both machine and human translators on measures of cultural relevance, coherence, and situational appropriateness.
- North America > Canada > Ontario > Toronto (0.04)
- North America > United States > Washington > King County > Seattle (0.04)
- North America > United States > Pennsylvania > Philadelphia County > Philadelphia (0.04)
- (9 more...)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.97)
- Information Technology > Artificial Intelligence > Representation & Reasoning (0.93)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.49)