AITopics | Mozannar, Hussein

Collaborating Authors

Mozannar, Hussein

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Navigating Rifts in Human-LLM Grounding: Study and Benchmark

Shaikh, Omar, Mozannar, Hussein, Bansal, Gagan, Fourney, Adam, Horvitz, Eric

arXiv.org Artificial IntelligenceMar-18-2025

Language models excel at following instructions but often struggle with the collaborative aspects of conversation that humans naturally employ. This limitation in grounding -- the process by which conversation participants establish mutual understanding -- can lead to outcomes ranging from frustrated users to serious consequences in high-stakes scenarios. To systematically study grounding challenges in human-LLM interactions, we analyze logs from three human-assistant datasets: WildChat, MultiWOZ, and Bing Chat. We develop a taxonomy of grounding acts and build models to annotate and forecast grounding behavior. Our findings reveal significant differences in human-human and human-LLM grounding: LLMs were three times less likely to initiate clarification and sixteen times less likely to provide follow-up requests than humans. Additionally, early grounding failures predicted later interaction breakdowns. Building on these insights, we introduce RIFTS: a benchmark derived from publicly available LLM interaction data containing situations where LLMs fail to initiate grounding. We note that current frontier models perform poorly on RIFTS, highlighting the need to reconsider how we train and prompt LLMs for human interaction. To this end, we develop a preliminary intervention that mitigates grounding failures.

large language model, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2503.13975

Country:

North America > United States > New York (0.14)
North America > United States > Colorado (0.14)

Genre: Research Report > New Finding (0.34)

Industry: Banking & Finance > Trading (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Challenges in Human-Agent Communication

Bansal, Gagan, Vaughan, Jennifer Wortman, Amershi, Saleema, Horvitz, Eric, Fourney, Adam, Mozannar, Hussein, Dibia, Victor, Weld, Daniel S.

arXiv.org Artificial IntelligenceNov-27-2024

Remarkable advancements in modern generative foundation models have enabled the development of sophisticated and highly capable autonomous agents that can observe their environment, invoke tools, and communicate with other agents to solve problems. Although such agents can communicate with users through natural language, their complexity and wide-ranging failure modes present novel challenges for human-AI interaction. Building on prior research and informed by a communication grounding perspective, we contribute to the study of \emph{human-agent communication} by identifying and analyzing twelve key communication challenges that these systems pose. These include challenges in conveying information from the agent to the user, challenges in enabling the user to convey information to the agent, and overarching challenges that need to be considered across all human-agent communication. We illustrate each challenge through concrete examples and identify open directions of research. Our findings provide insights into critical gaps in human-agent communication research and serve as an urgent call for new design patterns, principles, and guidelines to support transparency and control in these systems.

artificial intelligence, machine learning, survey article, (19 more...)

arXiv.org Artificial Intelligence

2412.1038

Country: North America > United States (1.00)

Genre:

Overview (1.00)
Research Report > New Finding (0.48)

Industry:

Information Technology > Security & Privacy (1.00)
Health & Medicine (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)
Information Technology > Artificial Intelligence > Representation & Reasoning > Agents > Agent Societies (0.46)

Add feedback

Magentic-One: A Generalist Multi-Agent System for Solving Complex Tasks

Fourney, Adam, Bansal, Gagan, Mozannar, Hussein, Tan, Cheng, Salinas, Eduardo, Erkang, null, Zhu, null, Niedtner, Friederike, Proebsting, Grace, Bassman, Griffin, Gerrits, Jack, Alber, Jacob, Chang, Peter, Loynd, Ricky, West, Robert, Dibia, Victor, Awadallah, Ahmed, Kamar, Ece, Hosn, Rafah, Amershi, Saleema

arXiv.org Artificial IntelligenceNov-7-2024

Modern AI agents, driven by advances in large foundation models, promise to enhance our productivity and transform our lives by augmenting our knowledge and capabilities. To achieve this vision, AI agents must effectively plan, perform multi-step reasoning and actions, respond to novel observations, and recover from errors, to successfully complete complex tasks across a wide range of scenarios. In this work, we introduce Magentic-One, a high-performing open-source agentic system for solving such tasks. Magentic-One uses a multi-agent architecture where a lead agent, the Orchestrator, plans, tracks progress, and re-plans to recover from errors. Throughout task execution, the Orchestrator directs other specialized agents to perform tasks as needed, such as operating a web browser, navigating local files, or writing and executing Python code. We show that Magentic-One achieves statistically competitive performance to the state-of-the-art on three diverse and challenging agentic benchmarks: GAIA, AssistantBench, and WebArena. Magentic-One achieves these results without modification to core agent capabilities or to how they collaborate, demonstrating progress towards generalist agentic systems. Moreover, Magentic-One's modular design allows agents to be added or removed from the team without additional prompt tuning or training, easing development and making it extensible to future scenarios. We provide an open-source implementation of Magentic-One, and we include AutoGenBench, a standalone tool for agentic evaluation. AutoGenBench provides built-in controls for repetition and isolation to run agentic benchmarks in a rigorous and contained manner -- which is important when agents' actions have side-effects. Magentic-One, AutoGenBench and detailed empirical performance evaluations of Magentic-One, including ablations and error analysis are available at https://aka.ms/magentic-one

artificial intelligence, deep learning, machine learning, (17 more...)

arXiv.org Artificial Intelligence

2411.04468

Country:

North America > United States (0.28)
Europe > Middle East > Malta (0.14)

Genre:

Research Report > New Finding (0.46)
Research Report > Experimental Study (0.46)

Industry: Information Technology (0.67)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.30)

Add feedback

Recent Advances, Applications, and Open Challenges in Machine Learning for Health: Reflections from Research Roundtables at ML4H 2023 Symposium

Jeong, Hyewon, Jabbour, Sarah, Yang, Yuzhe, Thapta, Rahul, Mozannar, Hussein, Han, William Jongwon, Mehandru, Nikita, Wornow, Michael, Lialin, Vladislav, Liu, Xin, Lozano, Alejandro, Zhu, Jiacheng, Kocielnik, Rafal Dariusz, Harrigian, Keith, Zhang, Haoran, Lee, Edward, Vukadinovic, Milos, Balagopalan, Aparna, Jeanselme, Vincent, Matton, Katherine, Demirel, Ilker, Fries, Jason, Rashidi, Parisa, Beaulieu-Jones, Brett, Xu, Xuhai Orson, McDermott, Matthew, Naumann, Tristan, Agrawal, Monica, Zitnik, Marinka, Ustun, Berk, Choi, Edward, Yeom, Kristen, Gursoy, Gamze, Ghassemi, Marzyeh, Pierson, Emma, Chen, George, Kanjilal, Sanjat, Oberst, Michael, Zhang, Linying, Singh, Harvineet, Hartvigsen, Tom, Zhou, Helen, Okolo, Chinasa T.

arXiv.org Artificial IntelligenceApr-5-2024

The third ML4H symposium was held in person on December 10, 2023, in New Orleans, Louisiana, USA. The symposium included research roundtable sessions to foster discussions between participants and senior researchers on timely and relevant topics for the \ac{ML4H} community. Encouraged by the successful virtual roundtables in the previous year, we organized eleven in-person roundtables and four virtual roundtables at ML4H 2022. The organization of the research roundtables at the conference involved 17 Senior Chairs and 19 Junior Chairs across 11 tables. Each roundtable session included invited senior chairs (with substantial experience in the field), junior chairs (responsible for facilitating the discussion), and attendees from diverse backgrounds with interest in the session's topic. Herein we detail the organization process and compile takeaways from these roundtable discussions, including recent advances, applications, and open challenges for each topic. We conclude with a summary and lessons learned across all roundtables. This document serves as a comprehensive review paper, summarizing the recent advancements in machine learning for healthcare as contributed by foremost researchers in the field.

artificial intelligence, machine learning, natural language, (16 more...)

arXiv.org Artificial Intelligence

2403.01628

Country: North America > United States > Louisiana > Orleans Parish > New Orleans (0.24)

Genre:

Research Report > Experimental Study (1.00)
Overview (0.87)
Research Report > Promising Solution (0.67)

Industry:

Information Technology > Security & Privacy (1.00)
Health & Medicine > Therapeutic Area (1.00)
Health & Medicine > Pharmaceuticals & Biotechnology (1.00)
(6 more...)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Issues > Social & Ethical Issues (1.00)
(2 more...)

Add feedback

The RealHumanEval: Evaluating Large Language Models' Abilities to Support Programmers

Mozannar, Hussein, Chen, Valerie, Alsobay, Mohammed, Das, Subhro, Zhao, Sebastian, Wei, Dennis, Nagireddy, Manish, Sattigeri, Prasanna, Talwalkar, Ameet, Sontag, David

arXiv.org Artificial IntelligenceApr-3-2024

Evaluation of large language models (LLMs) for code has primarily relied on static benchmarks, including HumanEval (Chen et al., 2021), which measure the ability of LLMs to generate complete code that passes unit tests. As LLMs are increasingly used as programmer assistants, we study whether gains on existing benchmarks translate to gains in programmer productivity when coding with LLMs, including time spent coding. In addition to static benchmarks, we investigate the utility of preference metrics that might be used as proxies to measure LLM helpfulness, such as code acceptance or copy rates. To do so, we introduce RealHumanEval, a web interface to measure the ability of LLMs to assist programmers, through either autocomplete or chat support. We conducted a user study (N=213) using RealHumanEval in which users interacted with six LLMs of varying base model performance. Despite static benchmarks not incorporating humans-in-the-loop, we find that improvements in benchmark performance lead to increased programmer productivity; however gaps in benchmark versus human performance are not proportional -- a trend that holds across both forms of LLM support. In contrast, we find that programmer preferences do not correlate with their actual performance, motivating the need for better, human-centric proxy signals. We also open-source RealHumanEval to enable human-centric evaluation of new models and the study data to facilitate efforts to improve code models.

large language model, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

2404.02806

Country: North America > United States > California (0.28)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (1.00)
Questionnaire & Opinion Survey (1.00)

Industry: Education (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.32)

Add feedback

Impact of Large Language Model Assistance on Patients Reading Clinical Notes: A Mixed-Methods Study

Mannhardt, Niklas, Bondi-Kelly, Elizabeth, Lam, Barbara, O'Connell, Chloe, Asiedu, Mercy, Mozannar, Hussein, Agrawal, Monica, Buendia, Alejandro, Urman, Tatiana, Riaz, Irbaz B., Ricciardi, Catherine E., Ghassemi, Marzyeh, Sontag, David

arXiv.org Artificial IntelligenceJan-17-2024

Patients derive numerous benefits from reading their clinical notes, including an increased sense of control over their health and improved understanding of their care plan. However, complex medical concepts and jargon within clinical notes hinder patient comprehension and may lead to anxiety. We developed a patient-facing tool to make clinical notes more readable, leveraging large language models (LLMs) to simplify, extract information from, and add context to notes. We prompt engineered GPT-4 to perform these augmentation tasks on real clinical notes donated by breast cancer survivors and synthetic notes generated by a clinician, a total of 12 notes with 3868 words. In June 2023, 200 female-identifying US-based participants were randomly assigned three clinical notes with varying levels of augmentations using our tool. Participants answered questions about each note, evaluating their understanding of follow-up actions and self-reported confidence. We found that augmentations were associated with a significant increase in action understanding score (0.63 $\pm$ 0.04 for select augmentations, compared to 0.54 $\pm$ 0.02 for the control) with p=0.002. In-depth interviews of self-identifying breast cancer patients (N=7) were also conducted via video conferencing. Augmentations, especially definitions, elicited positive responses among the seven participants, with some concerns about relying on LLMs. Augmentations were evaluated for errors by clinicians, and we found misleading errors occur, with errors more common in real donated notes than synthetic notes, illustrating the importance of carefully written clinical notes. Augmentations improve some but not all readability metrics. This work demonstrates the potential of LLMs to improve patients' experience with clinical notes at a lower burden to clinicians. However, having a human in the loop is important to correct potential model errors.

large language model, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2401.09637

Country: North America > United States > Michigan > Washtenaw County > Ann Arbor (0.14)

Genre:

Research Report > Strength High (1.00)
Research Report > New Finding (1.00)
Research Report > Experimental Study (1.00)
Personal > Interview (1.00)

Industry:

Health & Medicine > Health Care Technology > Medical Record (1.00)
Health & Medicine > Therapeutic Area > Oncology > Breast Cancer (0.55)
Health & Medicine > Therapeutic Area > Psychiatry/Psychology > Mental Health (0.48)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.88)

Add feedback

Effective Human-AI Teams via Learned Natural Language Rules and Onboarding

Mozannar, Hussein, Lee, Jimin J, Wei, Dennis, Sattigeri, Prasanna, Das, Subhro, Sontag, David

arXiv.org Artificial IntelligenceNov-7-2023

People are relying on AI agents to assist them with various tasks. The human must know when to rely on the agent, collaborate with the agent, or ignore its suggestions. In this work, we propose to learn rules, grounded in data regions and described in natural language, that illustrate how the human should collaborate with the AI. Our novel region discovery algorithm finds local regions in the data as neighborhoods in an embedding space where prior human behavior should be corrected. Each region is then described using a large language model in an iterative and contrastive procedure. We then teach these rules to the human via an onboarding stage. Through user studies on object detection and question-answering tasks, we show that our method can lead to more accurate human-AI teams. We also evaluate our region discovery and description algorithms separately.

artificial intelligence, learned natural language rule, natural language rule and onboarding, (1 more...)

arXiv.org Artificial Intelligence

2311.01007

Genre: Research Report (0.40)

Technology: Information Technology > Artificial Intelligence > Natural Language (1.00)

Add feedback

In Defense of Softmax Parametrization for Calibrated and Consistent Learning to Defer

Cao, Yuzhou, Mozannar, Hussein, Feng, Lei, Wei, Hongxin, An, Bo

arXiv.org Artificial IntelligenceNov-2-2023

Enabling machine learning classifiers to defer their decision to a downstream expert when the expert is more accurate will ensure improved safety and performance. This objective can be achieved with the learning-to-defer framework which aims to jointly learn how to classify and how to defer to the expert. In recent studies, it has been theoretically shown that popular estimators for learning to defer parameterized with softmax provide unbounded estimates for the likelihood of deferring which makes them uncalibrated. However, it remains unknown whether this is due to the widely used softmax parameterization and if we can find a softmax-based estimator that is both statistically consistent and possesses a valid probability estimator. In this work, we first show that the cause of the miscalibrated and unbounded estimator in prior literature is due to the symmetric nature of the surrogate losses used and not due to softmax. We then propose a novel statistically consistent asymmetric softmax-based surrogate loss that can produce valid estimates without the issue of unboundedness. We further analyze the non-asymptotic properties of our method and empirically validate its performance and calibration on benchmark datasets.

artificial intelligence, estimator, machine learning, (14 more...)

arXiv.org Artificial Intelligence

2311.01106

Country:

Asia (0.46)
North America > United States > Massachusetts > Middlesex County > Cambridge (0.14)

Genre: Research Report (1.00)

Industry: Health & Medicine > Diagnostic Medicine > Imaging (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

Closing the Gap in High-Risk Pregnancy Care Using Machine Learning and Human-AI Collaboration

Mozannar, Hussein, Utsumi, Yuria, Chen, Irene Y., Gervasi, Stephanie S., Ewing, Michele, Smith-McLallen, Aaron, Sontag, David

arXiv.org Artificial IntelligenceAug-28-2023

High-risk pregnancy (HRP) is a pregnancy complicated by factors that can adversely affect outcomes of the mother or the infant. Health insurers use algorithms to identify members who would benefit from additional clinical support. We aimed to build machine learning algorithms to identify pregnant patients and triage them by risk of complication to assist care management. In this retrospective study, we trained a hybrid Lasso regularized classifier to predict whether a patient is currently pregnant using claims data from 36735 insured members of Independence Blue Cross (IBC), a health insurer in Philadelphia. We then train a linear classifier on a subset of 12,243 members to predict whether a patient will develop gestational diabetes or gestational hypertension. These algorithms were developed in cooperation with the care management team at IBC and integrated into the dashboard. In small user studies with the nurses, we evaluated the impact of integrating our algorithms into their workflow. We find that the proposed model predicts an earlier pregnancy start date for 3.54% (95% CI 3.05-4.00) for patients with complications compared to only using a set of pre-defined codes that indicate the start of pregnancy and never later at the expense of a 5.58% (95% CI 4.05-6.40) false positive rate. The classifier for predicting complications has an AUC of 0.754 (95% CI 0.764-0.788) using data up to the patient's first trimester. Nurses from the care management program expressed a preference for the proposed models over existing approaches. The proposed model outperformed commonly used claim codes for the identification of pregnant patients at the expense of a manageable false positive rate. Our risk complication classifier shows that we can accurately triage patients by risk of complication.

artificial intelligence, complication, machine learning, (19 more...)

arXiv.org Artificial Intelligence

2305.17261

Country: North America > United States > Massachusetts > Middlesex County > Cambridge (0.14)

Genre: Research Report > Experimental Study (1.00)

Industry:

Health & Medicine > Therapeutic Area > Obstetrics/Gynecology (1.00)
Health & Medicine > Public Health > Maternal Health (1.00)
Banking & Finance > Insurance (1.00)
(3 more...)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (1.00)

Add feedback

When to Show a Suggestion? Integrating Human Feedback in AI-Assisted Programming

Mozannar, Hussein, Bansal, Gagan, Fourney, Adam, Horvitz, Eric

arXiv.org Artificial IntelligenceAug-28-2023

AI powered code-recommendation systems, such as Copilot and CodeWhisperer, provide code suggestions inside a programmer's environment (e.g., an IDE) with the aim to improve their productivity. Since, in these scenarios, programmers accept and reject suggestions, ideally, such a system should use this feedback in furtherance of this goal. In this work, we leverage prior data of programmers interacting with GitHub Copilot, a system used by millions of programmers, to develop interventions that can save programmer time. We propose a utility theory framework, which models this interaction with programmers and decides which suggestions to display. Our framework Conditional suggestion Display from Human Feedback (CDHF), relies on a cascade of models that predict suggestion acceptance to selectively hide suggestions reducing both latency and programmer verification time. Using data from 535 programmers, we perform a retrospective evaluation of CDHF and show that we can avoid displaying a significant fraction of suggestions that would have been rejected doing so without total knowledge of the suggestions themselves. We further demonstrate the importance of incorporating the programmer's latent unobserved state in deciding when to display suggestions through ablations on user study data. Finally, we showcase that using suggestion acceptance as a reward signal to know which suggestions to display leads to reduced quality suggestions indicating an unexpected pitfall.

data mining, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2306.0493

Country: North America > United States (0.28)

Genre:

Research Report > Experimental Study (1.00)
Research Report > New Finding (0.93)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
(2 more...)

Add feedback