chat history
- Asia > Middle East > Iran (0.05)
- North America > United States > California > San Francisco County > San Francisco (0.05)
- Europe > Slovakia (0.05)
- (2 more...)
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.04)
- North America > United States > California (0.04)
- Europe > Spain (0.04)
- Media > News (1.00)
- Health & Medicine > Therapeutic Area > Psychiatry/Psychology (0.71)
- Government > Regional Government > North America Government > United States Government (0.49)
Completion $\neq$ Collaboration: Scaling Collaborative Effort with Agents
Shen, Shannon Zejiang, Chen, Valerie, Gu, Ken, Ross, Alexis, Ma, Zixian, Ross, Jillian, Gu, Alex, Si, Chenglei, Chi, Wayne, Peng, Andi, Shen, Jocelyn J, Talwalkar, Ameet, Wu, Tongshuang, Sontag, David
Current evaluations of agents remain centered around one-shot task completion, failing to account for the inherently iterative and collaborative nature of many real-world problems, where human goals are often underspecified and evolve. We argue for a shift from building and assessing task completion agents to developing collaborative agents, assessed not only by the quality of their final outputs but by how well they engage with and enhance human effort throughout the problem-solving process. To support this shift, we introduce collaborative effort scaling, a framework that captures how an agent's utility grows with increasing user involvement. Through case studies and simulated evaluations, we show that state-of-the-art agents often underperform in multi-turn, real-world scenarios, revealing a missing ingredient in agent design: the ability to sustain engagement and scaffold user understanding. Collaborative effort scaling offers a lens for diagnosing agent behavior and guiding development toward more effective interactions.
- North America > United States > New York > New York County > New York City (0.04)
- North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.04)
- North America > United States > Michigan (0.04)
- (4 more...)
- Banking & Finance (1.00)
- Education (0.93)
HalluDetect: Detecting, Mitigating, and Benchmarking Hallucinations in Conversational Systems in the Legal Domain
Anaokar, Spandan, Ganatra, Shrey, Kashid, Harshvivek, Bhattacharyya, Swapnil, Nair, Shruti, Sekhar, Reshma, Manohar, Siddharth, Hemrajani, Rahul, Bhattacharyya, Pushpak
Large Language Models (LLMs) are widely used in industry but remain prone to hallucinations, limiting their reliability in critical applications. This work addresses hallucination reduction in consumer grievance chatbots built using LLaMA 3.1 8B Instruct, a compact model frequently used in industry. We develop HalluDetect, an LLM-based hallucination detection system that achieves an F1 score of 68.92% outperforming baseline detectors by 22.47%. Benchmarking five hallucination mitigation architectures, we find that out of them, AgentBot minimizes hallucinations to 0.4159 per turn while maintaining the highest token accuracy (96.13%), making it the most effective mitigation strategy. Our findings provide a scalable framework for hallucination mitigation, demonstrating that optimized inference strategies can significantly improve factual accuracy.
- Asia > India > Tripura > Agartala (0.04)
- Asia > India > Karnataka > Bengaluru (0.04)
- North America > Dominican Republic (0.04)
- Europe > Belgium > Brussels-Capital Region > Brussels (0.04)
- Research Report > New Finding (0.48)
- Research Report > Experimental Study (0.46)
Do Students Rely on AI? Analysis of Student-ChatGPT Conversations from a Field Study
Zheng, Jiayu, Hao, Lingxin, Lu, Kelun, Garg, Ashi, Reese, Mike, Yap, Melo-Jean, Wang, I-Jeng, Wu, Xingyun, Huang, Wenrui, Hoffman, Jenna, Kelly, Ariane, Le, My, Zhang, Ryan, Lin, Yanyu, Faayez, Muhammad, Liu, Anqi
This study explores how college students interact with generative AI (ChatGPT-4) during educational quizzes, focusing on reliance and predictors of AI adoption. Conducted at the early stages of ChatGPT implementation, when students had limited familiarity with the tool, this field study analyzed 315 student-AI conversations during a brief, quiz-based scenario across various STEM courses. A novel four-stage reliance taxonomy was introduced to capture students' reliance patterns, distinguishing AI competence, relevance, adoption, and students' final answer correctness. Three findings emerged. First, students exhibited overall low reliance on AI and many of them could not effectively use AI for learning. Second, negative reliance patterns often persisted across interactions, highlighting students' difficulty in effectively shifting strategies after unsuccessful initial experiences. Third, certain behavioral metrics strongly predicted AI reliance, highlighting potential behavioral mechanisms to explain AI adoption. The study's findings underline critical implications for ethical AI integration in education and the broader field. It emphasizes the need for enhanced onboarding processes to improve student's familiarity and effective use of AI tools. Furthermore, AI interfaces should be designed with reliance-calibration mechanisms to enhance appropriate reliance. Ultimately, this research advances understanding of AI reliance dynamics, providing foundational insights for ethically sound and cognitively enriching AI practices.
- North America > United States > Pennsylvania > Philadelphia County > Philadelphia (0.04)
- North America > United States > Maryland > Prince George's County > Laurel (0.04)
- North America > United States > Maryland > Baltimore (0.04)
- Research Report > New Finding (0.68)
- Research Report > Experimental Study (0.46)
- Instructional Material > Course Syllabus & Notes (0.46)
OdysseyBench: Evaluating LLM Agents on Long-Horizon Complex Office Application Workflows
Wang, Weixuan, Han, Dongge, Diaz, Daniel Madrigal, Xu, Jin, Rühle, Victor, Rajmohan, Saravan
Autonomous agents powered by large language models (LLMs) are increasingly deployed in real-world applications requiring complex, long-horizon workflows. However, existing benchmarks predominantly focus on atomic tasks that are self-contained and independent, failing to capture the long-term contextual dependencies and multi-interaction coordination required in realistic scenarios. To address this gap, we introduce OdysseyBench, a comprehensive benchmark for evaluating LLM agents on long-horizon workflows across diverse office applications including Word, Excel, PDF, Email, and Calendar. Our benchmark comprises two complementary splits: OdysseyBench+ with 300 tasks derived from real-world use cases, and OdysseyBench-Neo with 302 newly synthesized complex tasks. Each task requires agent to identify essential information from long-horizon interaction histories and perform multi-step reasoning across various applications. To enable scalable benchmark creation, we propose HomerAgents, a multi-agent framework that automates the generation of long-horizon workflow benchmarks through systematic environment exploration, task generation, and dialogue synthesis. Our extensive evaluation demonstrates that OdysseyBench effectively challenges state-of-the-art LLM agents, providing more accurate assessment of their capabilities in complex, real-world contexts compared to existing atomic task benchmarks. We believe that OdysseyBench will serve as a valuable resource for advancing the development and evaluation of LLM agents in real-world productivity scenarios. In addition, we release OdysseyBench and HomerAgents to foster research along this line.
- North America > United States > California > San Francisco County > San Francisco (0.14)
- Europe > Finland (0.04)
- Research Report (1.00)
- Workflow (0.90)
Dangers of oversharing with AI tools
Fox News chief political anchor Bret Baier has the latest on regulatory uncertainty amid artificial intelligence development on "Special Report." Have you ever stopped to think about how much your chatbot knows about you? Over the years, tools like ChatGPT have become incredibly adept at learning your preferences, habits and even some of your deepest secrets. But while this can make them seem more helpful and personalized, it also raises some serious privacy concerns. As much as you learn from these AI tools, they learn just as much about you.
Detecting Ambiguities to Guide Query Rewrite for Robust Conversations in Enterprise AI Assistants
Tanjim, Md Mehrab, Chen, Xiang, Bursztyn, Victor S., Bhattacharya, Uttaran, Mai, Tung, Muppala, Vaishnavi, Maharaj, Akash, Mitra, Saayan, Koh, Eunyee, Li, Yunyao, Russell, Ken
Multi-turn conversations with an Enterprise AI Assistant can be challenging due to conversational dependencies in questions, leading to ambiguities and errors. To address this, we propose an NLU-NLG framework for ambiguity detection and resolution through reformulating query automatically and introduce a new task called "Ambiguity-guided Query Rewrite." To detect ambiguities, we develop a taxonomy based on real user conversational logs and draw insights from it to design rules and extract features for a classifier which yields superior performance in detecting ambiguous queries, outperforming LLM-based baselines. Furthermore, coupling the query rewrite module with our ambiguity detecting classifier shows that this end-to-end framework can effectively mitigate ambiguities without risking unnecessary insertions of unwanted phrases for clear queries, leading to an improvement in the overall performance of the AI Assistant. Due to its significance, this has been deployed in the real world application, namely Adobe Experience Platform AI Assistant.
- North America > United States > New York > New York County > New York City (0.04)
- North America > Mexico > Mexico City > Mexico City (0.04)
- Europe > Germany > North Rhine-Westphalia > Cologne Region > Cologne (0.04)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Personal Assistant Systems (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.49)
Few-shot Policy (de)composition in Conversational Question Answering
Erwin, Kyle, Axelrod, Guy, Chang, Maria, Fokoue, Achille, Crouse, Maxwell, Dan, Soham, Gao, Tian, Uceda-Sosa, Rosario, Makondo, Ndivhuwo, Khan, Naweed, Gray, Alexander
The task of policy compliance detection (PCD) is to determine if a scenario is in compliance with respect to a set of written policies. In a conversational setting, the results of PCD can indicate if clarifying questions must be asked to determine compliance status. Existing approaches usually claim to have reasoning capabilities that are latent or require a large amount of annotated data. In this work, we propose logical decomposition for policy compliance (LDPC): a neuro-symbolic framework to detect policy compliance using large language models (LLMs) in a few-shot setting. By selecting only a few exemplars alongside recently developed prompting techniques, we demonstrate that our approach soundly reasons about policy compliance conversations by extracting sub-questions to be answered, assigning truth values from contextual information, and explicitly producing a set of logic statements from the given policies. The formulation of explicit logic graphs can in turn help answer PCDrelated questions with increased transparency and explainability. We apply this approach to the popular PCD and conversational machine reading benchmark, ShARC, and show competitive performance with no task-specific finetuning. We also leverage the inherently interpretable architecture of LDPC to understand where errors occur, revealing ambiguities in the ShARC dataset and highlighting the challenges involved with reasoning for conversational question answering.
Everything you can do with Microsoft's Copilot AI assistant on Windows
It's impossible to ignore the rapid rise in the capabilities of artificial intelligence tools in recent months. Microsoft hasn't been shy in stuffing Windows full of AI features: Windows computers now come with a dedicated key for launching Copilot, Microsoft's AI assistant, which has been integrated into the operating system. We'll guide you through everything you can use Copilot for on your Windows laptop or desktop, and how you can get it up and running. We'll also explain the difference between Copilot and a Copilot PC, which is a label you might have spotted if you've been shopping for a Windows computer lately. When it comes to the Copilot assistant inside Windows, it's very similar to the Copilot app on the web.