AITopics | Klein, Dan

Collaborating Authors

Klein, Dan

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Why Do Multi-Agent LLM Systems Fail?

Cemri, Mert, Pan, Melissa Z., Yang, Shuyi, Agrawal, Lakshya A., Chopra, Bhavya, Tiwari, Rishabh, Keutzer, Kurt, Parameswaran, Aditya, Klein, Dan, Ramchandran, Kannan, Zaharia, Matei, Gonzalez, Joseph E., Stoica, Ion

arXiv.org Artificial IntelligenceMar-17-2025

Despite growing enthusiasm for Multi-Agent Systems (MAS), where multiple LLM agents collaborate to accomplish tasks, their performance gains across popular benchmarks remain minimal compared to single-agent frameworks. This gap highlights the need to analyze the challenges hindering MAS effectiveness. In this paper, we present the first comprehensive study of MAS challenges. We analyze five popular MAS frameworks across over 150 tasks, involving six expert human annotators. We identify 14 unique failure modes and propose a comprehensive taxonomy applicable to various MAS frameworks. This taxonomy emerges iteratively from agreements among three expert annotators per study, achieving a Cohen's Kappa score of 0.88. These fine-grained failure modes are organized into 3 categories, (i) specification and system design failures, (ii) inter-agent misalignment, and (iii) task verification and termination. To support scalable evaluation, we integrate MASFT with LLM-as-a-Judge. We also explore if identified failures could be easily prevented by proposing two interventions: improved specification of agent roles and enhanced orchestration strategies. Our findings reveal that identified failures require more complex solutions, highlighting a clear roadmap for future research. We open-source our dataset and LLM annotator.

artificial intelligence, large language model, natural language, (14 more...)

arXiv.org Artificial Intelligence

2503.13657

Country: North America (0.14)

Genre: Research Report > New Finding (1.00)

Industry:

Health & Medicine (0.67)
Information Technology (0.67)
Leisure & Entertainment > Games (0.46)
Media > Music (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

Add feedback

Enough Coin Flips Can Make LLMs Act Bayesian

Gupta, Ritwik, Corona, Rodolfo, Ge, Jiaxin, Wang, Eric, Klein, Dan, Darrell, Trevor, Chan, David M.

arXiv.org Artificial IntelligenceMar-6-2025

Large language models (LLMs) exhibit the ability to generalize given few-shot examples in their input prompt, an emergent capability known as in-context learning (ICL). We investigate whether LLMs utilize ICL to perform structured reasoning in ways that are consistent with a Bayesian framework or rely on pattern matching. Using a controlled setting of biased coin flips, we find that: (1) LLMs often possess biased priors, causing initial divergence in zero-shot settings, (2) in-context evidence outweighs explicit bias instructions, (3) LLMs broadly follow Bayesian posterior updates, with deviations primarily due to miscalibrated priors rather than flawed updates, and (4) attention magnitude has negligible effect on Bayesian inference. With sufficient demonstrations of biased coin flips via ICL, LLMs update their priors in a Bayesian manner.

large language model, machine learning, natural language, (16 more...)

arXiv.org Artificial Intelligence

2503.04722

Country: North America > United States > California (0.14)

Genre: Research Report > New Finding (1.00)

Industry: Government > Regional Government (0.46)

Add feedback

LangProBe: a Language Programs Benchmark

Tan, Shangyin, Agrawal, Lakshya A, Singhvi, Arnav, Lai, Liheng, Ryan, Michael J, Klein, Dan, Khattab, Omar, Sen, Koushik, Zaharia, Matei

arXiv.org Artificial IntelligenceFeb-27-2025

Composing language models (LMs) into multi-step language programs and automatically optimizing their modular prompts is now a mainstream paradigm for building AI systems, but the tradeoffs in this space have only scarcely been studied before. We introduce LangProBe, the first large-scale benchmark for evaluating the architectures and optimization strategies for language programs, with over 2000 combinations of tasks, architectures, optimizers, and choices of LMs. Using LangProBe, we are the first to study the impact of program architectures and optimizers (and their compositions together and with different models) on tradeoffs of quality and cost. We find that optimized language programs offer strong cost--quality Pareto improvement over raw calls to models, but simultaneously demonstrate that human judgment (or empirical decisions) about which compositions to pursue is still necessary for best performance. We will open source the code and evaluation data for LangProBe.

large language model, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

2502.20315

Country: North America > United States > California (0.14)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.95)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.95)

Add feedback

Predicting Emergent Capabilities by Finetuning

Snell, Charlie, Wallace, Eric, Klein, Dan, Levine, Sergey

arXiv.org Artificial IntelligenceNov-24-2024

A fundamental open challenge in modern LLM scaling is the lack of understanding around emergent capabilities. In particular, language model pretraining loss is known to be highly predictable as a function of compute. However, downstream capabilities are far less predictable -- sometimes even exhibiting emergent jumps -- which makes it challenging to anticipate the capabilities of future models. In this work, we first pose the task of emergence prediction: given access to current LLMs that have random few-shot accuracy on a task, can we predict whether future models (GPT-N+1) will have non-trivial accuracy on that task? We then discover a simple insight for this problem: finetuning LLMs on a given task can shift the point in scaling at which emergence occurs towards less capable models. To operationalize this insight, we can finetune LLMs with varying amounts of data and fit a parametric function that predicts when emergence will occur (i.e., "emergence laws"). We validate this approach using four standard NLP benchmarks where large-scale open-source LLMs already demonstrate emergence (MMLU, GSM8K, CommonsenseQA, and CoLA). Using only small-scale LLMs, we find that, in some cases, we can accurately predict whether models trained with up to 4x more compute have emerged. Finally, we present a case study of two realistic uses for emergence prediction.

large language model, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2411.16035

Country:

Europe (0.28)
North America > United States > Wisconsin (0.14)
North America > United States > California (0.14)
Asia > Middle East > UAE (0.14)

Genre: Research Report (0.65)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.48)

Add feedback

Efficacy of Language Model Self-Play in Non-Zero-Sum Games

Liao, Austen, Tomlin, Nicholas, Klein, Dan

arXiv.org Artificial IntelligenceJun-26-2024

Game-playing agents like AlphaGo have achieved superhuman performance through self-play, which is theoretically guaranteed to yield optimal policies in competitive games. However, most language tasks are partially or fully cooperative, so it is an open question whether techniques like self-play can effectively be used to improve language models. We empirically investigate this question in a negotiation game setting known as Deal or No Deal (DoND). Crucially, the objective in DoND can be modified to produce a fully cooperative game, a strictly competitive one, or anything in between. We finetune language models in self-play over multiple rounds of filtered behavior cloning in DoND for each of these objectives. Contrary to expectations, we find that language model self-play leads to significant performance gains in both cooperation and competition with humans, suggesting that self-play and related techniques have promise despite a lack of theoretical guarantees.

large language model, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2406.18872

Country:

Europe (0.68)
North America > United States > California (0.14)

Genre: Research Report > New Finding (0.68)

Industry: Leisure & Entertainment > Games > Go (0.34)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.96)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.49)

Add feedback

Linguistic Bias in ChatGPT: Language Models Reinforce Dialect Discrimination

Fleisig, Eve, Smith, Genevieve, Bossi, Madeline, Rustagi, Ishita, Yin, Xavier, Klein, Dan

arXiv.org Artificial IntelligenceJun-13-2024

We present a large-scale study of linguistic bias exhibited by ChatGPT covering ten dialects of English (Standard American English, Standard British English, and eight widely spoken non-"standard" varieties from around the world). We prompted GPT-3.5 Turbo and GPT-4 with text by native speakers of each variety and analyzed the responses via detailed linguistic feature annotation and native speaker evaluation. We find that the models default to "standard" varieties of English; based on evaluation by native speakers, we also find that model responses to non-"standard" varieties consistently exhibit a range of issues: lack of comprehension (10% worse compared to "standard" varieties), stereotyping (16% worse), demeaning content (22% worse), and condescending responses (12% worse). We also find that if these models are asked to imitate the writing style of prompts in non-"standard" varieties, they produce text that exhibits lower comprehension of the input and is especially prone to stereotyping. GPT-4 improves on GPT-3.5 in terms of comprehension, warmth, and friendliness, but it also results in a marked increase in stereotyping (+17%). The results suggest that GPT-3.5 Turbo and GPT-4 exhibit linguistic discrimination in ways that can exacerbate harms for speakers of non-"standard" varieties.

large language model, machine learning, natural language, (22 more...)

arXiv.org Artificial Intelligence

2406.08818

Country:

North America > United States (1.00)
Asia (0.93)
Africa (0.93)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.14)

Genre:

Research Report > New Finding (1.00)
Questionnaire & Opinion Survey (1.00)
Research Report > Experimental Study (0.68)

Industry:

Law (0.67)
Government (0.67)
Information Technology > Security & Privacy (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

American Sign Language Handshapes Reflect Pressures for Communicative Efficiency

Yin, Kayo, Regier, Terry, Klein, Dan

arXiv.org Artificial IntelligenceJun-10-2024

Communicative efficiency is a key topic in linguistics and cognitive psychology, with many studies demonstrating how the pressure to communicate with minimal effort guides the form of natural language. However, this phenomenon is rarely explored in signed languages. This paper shows how handshapes in American Sign Language (ASL) reflect these efficiency pressures and provides new evidence of communicative efficiency in the visual-gestural modality. We focus on hand configurations in native ASL signs and signs borrowed from English to compare efficiency pressures from both ASL and English usage. First, we develop new methodologies to quantify the articulatory effort needed to produce handshapes and the perceptual effort required to recognize them. Then, we analyze correlations between communicative effort and usage statistics in ASL or English. Our findings reveal that frequent ASL handshapes are easier to produce and that pressures for communicative efficiency mostly come from ASL usage, rather than from English lexical borrowing.

handshape, natural language, simulation of human behavior, (19 more...)

arXiv.org Artificial Intelligence

2406.04024

Country:

North America > United States (0.14)
Asia (0.14)

Genre: Research Report > New Finding (1.00)

Industry:

Education > Curriculum > Subject-Specific Education (0.64)
Health & Medicine (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language (0.88)
Information Technology > Artificial Intelligence > Cognitive Science > Simulation of Human Behavior (0.34)

Add feedback

The Perspectivist Paradigm Shift: Assumptions and Challenges of Capturing Human Labels

Fleisig, Eve, Blodgett, Su Lin, Klein, Dan, Talat, Zeerak

arXiv.org Artificial IntelligenceMay-9-2024

Longstanding data labeling practices in machine learning involve collecting and aggregating labels from multiple annotators. But what should we do when annotators disagree? Though annotator disagreement has long been seen as a problem to minimize, new perspectivist approaches challenge this assumption by treating disagreement as a valuable source of information. In this position paper, we examine practices and assumptions surrounding the causes of disagreement--some challenged by perspectivist approaches, and some that remain to be addressed--as well as practical and normative challenges for work operating under these assumptions. We conclude with recommendations for the data labeling pipeline and avenues for future research engaging with subjectivity and disagreement.

artificial intelligence, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2405.0586

Country:

Asia (0.93)
Europe > United Kingdom > England (0.28)
North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)

Genre: Research Report (0.82)

Industry: Education (0.68)

Technology:

Information Technology > Communications > Social Media (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

Pose Priors from Language Models

Subramanian, Sanjay, Ng, Evonne, Müller, Lea, Klein, Dan, Ginosar, Shiry, Darrell, Trevor

arXiv.org Artificial IntelligenceMay-6-2024

We present a zero-shot pose optimization method that enforces accurate physical contact constraints when estimating the 3D pose of humans. Our central insight is that since language is often used to describe physical interaction, large pretrained text-based models can act as priors on pose estimation. We can thus leverage this insight to improve pose estimation by converting natural language descriptors, generated by a large multimodal model (LMM), into tractable losses to constrain the 3D pose optimization. Despite its simplicity, our method produces surprisingly compelling pose reconstructions of people in close contact, correctly capturing the semantics of the social and physical interactions. We demonstrate that our method rivals more complex state-of-the-art approaches that require expensive human annotation of contact points and training specialized models. Moreover, unlike previous approaches, our method provides a unified framework for resolving self-contact and person-to-person contact.

constraint, large language model, natural language, (16 more...)

arXiv.org Artificial Intelligence

2405.03689

Country:

Europe (0.28)
North America > United States > California (0.14)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.91)
Information Technology > Artificial Intelligence > Vision > Video Understanding (0.56)

Add feedback

CLARINET: Augmenting Language Models to Ask Clarification Questions for Retrieval

Chi, Yizhou, Lin, Jessy, Lin, Kevin, Klein, Dan

arXiv.org Artificial IntelligenceApr-28-2024

Users often make ambiguous requests that require clarification. We study the problem of asking clarification questions in an information retrieval setting, where systems often face ambiguous search queries and it is challenging to turn the uncertainty in the retrieval model into a natural language question. We present CLARINET, a system that asks informative clarification questions by choosing questions whose answers would maximize certainty in the correct candidate. Our approach works by augmenting a large language model (LLM) to condition on a retrieval distribution, finetuning end-to-end to generate the question that would have maximized the rank of the true candidate at each turn. When evaluated on a real-world retrieval dataset of users searching for books, our system outperforms traditional heuristics such as information gain on retrieval success by 17% and vanilla-prompted LLMs by 39% relative.

artificial intelligence, large language model, natural language, (16 more...)

arXiv.org Artificial Intelligence

2405.15784

Country: North America (0.14)

Genre: Research Report (0.64)

Industry: Media (0.93)

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.90)

Add feedback