AITopics | Hardy, Amelia

Collaborating Authors

Hardy, Amelia

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

More than Marketing? On the Information Value of AI Benchmarks for Practitioners

Hardy, Amelia, Reuel, Anka, Meimandi, Kiana Jafari, Soder, Lisa, Griffith, Allie, Asmar, Dylan M., Koyejo, Sanmi, Bernstein, Michael S., Kochenderfer, Mykel J.

arXiv.org Artificial IntelligenceDec-6-2024

Public AI benchmark results are widely broadcast by model developers as indicators of model quality within a growing and competitive market. However, these advertised scores do not necessarily reflect the traits of interest to those who will ultimately apply AI models. In this paper, we seek to understand if and how AI benchmarks are used to inform decision-making. Based on the analyses of interviews with 19 individuals who have used, or decided against using, benchmarks in their day-to-day work, we find that across these settings, participants use benchmarks as a signal of relative performance difference between models. However, whether this signal was considered a definitive sign of model superiority, sufficient for downstream decisions, varied. In academia, public benchmarks were generally viewed as suitable measures for capturing research progress. By contrast, in both product and policy, benchmarks -- even those developed internally for specific tasks -- were often found to be inadequate for informing substantive decisions. Of the benchmarks deemed unsatisfactory, respondents reported that their goals were neither well-defined nor reflective of real-world use. Based on the study results, we conclude that effective benchmarks should provide meaningful, real-world evaluations, incorporate domain expertise, and maintain transparency in scope and goals. They must capture diverse, task-relevant capabilities, be challenging enough to avoid quick saturation, and account for trade-offs in model performance rather than relying on a single score. Additionally, proprietary data collection and contamination prevention are critical for producing reliable and actionable results. By adhering to these criteria, benchmarks can move beyond mere marketing tricks into robust evaluative frameworks.

benchmark, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2412.0552

Country: North America > United States > California (0.29)

Genre:

Questionnaire & Opinion Survey (1.00)
Personal > Interview (1.00)
Research Report > New Finding (0.93)

Industry:

Health & Medicine > Pharmaceuticals & Biotechnology (0.46)
Health & Medicine > Therapeutic Area (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Issues > Social & Ethical Issues (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (0.93)
(2 more...)

Add feedback

BetterBench: Assessing AI Benchmarks, Uncovering Issues, and Establishing Best Practices

Reuel, Anka, Hardy, Amelia, Smith, Chandler, Lamparth, Max, Hardy, Malcolm, Kochenderfer, Mykel J.

arXiv.org Artificial IntelligenceNov-19-2024

AI models are increasingly prevalent in high-stakes environments, necessitating thorough assessment of their capabilities and risks. Benchmarks are popular for measuring these attributes and for comparing model performance, tracking progress, and identifying weaknesses in foundation and non-foundation models. They can inform model selection for downstream tasks and influence policy initiatives. However, not all benchmarks are the same: their quality depends on their design and usability. In this paper, we develop an assessment framework considering 46 best practices across an AI benchmark's lifecycle and evaluate 24 AI benchmarks against it. We find that there exist large quality differences and that commonly used benchmarks suffer from significant issues. We further find that most benchmarks do not report statistical significance of their results nor allow for their results to be easily replicated. To support benchmark developers in aligning with best practices, we provide a checklist for minimum quality assurance based on our assessment. We also develop a living repository of benchmark assessments to support benchmark comparability, accessible at betterbench.stanford.edu.

benchmark, machine learning, natural language, (21 more...)

arXiv.org Artificial Intelligence

2411.1299

Country: North America > United States > California > Santa Clara County > Palo Alto (0.24)

Genre:

Research Report > Experimental Study (1.00)
Overview (1.00)

Industry:

Law (1.00)
Government (1.00)
Information Technology > Security & Privacy (0.92)
(2 more...)

Technology:

Information Technology > Data Science (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Security & Privacy (0.92)
(2 more...)

Add feedback

Evaluating Human-Language Model Interaction

Lee, Mina, Srivastava, Megha, Hardy, Amelia, Thickstun, John, Durmus, Esin, Paranjape, Ashwin, Gerard-Ursin, Ines, Li, Xiang Lisa, Ladhak, Faisal, Rong, Frieda, Wang, Rose E., Kwon, Minae, Park, Joon Sung, Cao, Hancheng, Lee, Tony, Bommasani, Rishi, Bernstein, Michael, Liang, Percy

arXiv.org Artificial IntelligenceJan-5-2024

Many real-world applications of language models (LMs), such as writing assistance and code autocomplete, involve human-LM interaction. However, most benchmarks are non-interactive in that a model produces output without human involvement. To evaluate human-LM interaction, we develop a new framework, Human-AI Language-based Interaction Evaluation (HALIE), that defines the components of interactive systems and dimensions to consider when designing evaluation metrics. Compared to standard, non-interactive evaluation, HALIE captures (i) the interactive process, not only the final output; (ii) the first-person subjective experience, not just a third-party assessment; and (iii) notions of preference beyond quality (e.g., enjoyment and ownership). We then design five tasks to cover different forms of interaction: social dialogue, question answering, crossword puzzles, summarization, and metaphor generation. With four state-of-the-art LMs (three variants of OpenAI's GPT-3 and AI21 Labs' Jurassic-1), we find that better non-interactive performance does not always translate to better human-LM interaction. In particular, we highlight three cases where the results from non-interactive and interactive metrics diverge and underscore the importance of human-LM interaction for LM evaluation.

arxiv preprint arxiv, large language model, machine learning, (19 more...)

arXiv.org Artificial Intelligence

2212.09746

Country:

North America > United States (1.00)
Europe > United Kingdom > England (0.14)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (1.00)
Questionnaire & Opinion Survey (1.00)

Industry:

Leisure & Entertainment > Games (1.00)
Law Enforcement & Public Safety (1.00)
Education (1.00)
(3 more...)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Personal Assistant Systems (0.93)
(2 more...)

Add feedback

Neural Generation Meets Real People: Towards Emotionally Engaging Mixed-Initiative Conversations

Paranjape, Ashwin, See, Abigail, Kenealy, Kathleen, Li, Haojun, Hardy, Amelia, Qi, Peng, Sadagopan, Kaushik Ram, Phu, Nguyet Minh, Soylu, Dilara, Manning, Christopher D.

arXiv.org Artificial IntelligenceSep-5-2020

Building an open-domain socialbot that talks to real people is challenging - such a system must meet multiple user expectations such as broad world knowledge, conversational style, and emotional connection. Our socialbot engages users on their terms - prioritizing their interests, feelings and autonomy. As a result, our socialbot provides a responsive, personalized user experience, capable of talking knowledgeably about a wide variety of topics, as well as chatting empathetically about ordinary life. Neural generation plays a key role in achieving these goals, providing the backbone for our conversational and emotional tone. At the end of the competition, Chirpy Cardinal progressed to the finals with an average rating of 3.6/5.0,

human computer interaction, immunology, utterance, (22 more...)

arXiv.org Artificial Intelligence

2008.12348

Country: North America > United States (0.67)

Genre:

Personal > Interview (1.00)
Research Report > New Finding (0.93)

Industry:

Leisure & Entertainment (1.00)
Health & Medicine > Therapeutic Area (0.68)
Media > Film (0.68)
Media > Music (0.46)

Technology:

Information Technology > Human Computer Interaction (1.00)
Information Technology > Communications > Social Media (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
(3 more...)

Add feedback