AITopics | Schlangen, David

Collaborating Authors

Schlangen, David

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Can Visual Dialogue Models Do Scorekeeping? Exploring How Dialogue Representations Incrementally Encode Shared Knowledge

Madureira, Brielen, Schlangen, David

arXiv.org Artificial IntelligenceFeb-25-2025

Cognitively plausible visual dialogue models should keep a mental scoreboard of shared established facts in the dialogue context. We propose a theory-based evaluation method for investigating to what degree models pretrained on the VisDial dataset incrementally build representations that appropriately do scorekeeping. Our conclusion is that the ability to make the distinction between shared and privately known statements along the dialogue is moderately present in the analysed models, but not always incrementally consistent, which may partially be due to the limited need for grounding interactions in the original task.

artificial intelligence, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2204.0697

Country:

Europe (1.00)
North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)

Genre: Research Report > New Finding (0.34)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

Triangulating LLM Progress through Benchmarks, Games, and Cognitive Tests

Momentè, Filippo, Suglia, Alessandro, Giulianelli, Mario, Ferrari, Ambra, Koller, Alexander, Lemon, Oliver, Schlangen, David, Fernández, Raquel, Bernardi, Raffaella

arXiv.org Artificial IntelligenceFeb-20-2025

We examine three evaluation paradigms: large question-answering benchmarks (e.g., MMLU and BBH), interactive games (e.g., Signalling Games or Taboo), and cognitive tests (e.g., for working memory or theory of mind). First, we investigate which of the former two-benchmarks or games-is most effective at discriminating LLMs of varying quality. Then, inspired by human cognitive assessments, we compile a suite of targeted tests that measure cognitive abilities deemed essential for effective language use, and we investigate their correlation with model performance in benchmarks and games. Our analyses reveal that interactive games are superior to standard benchmarks in discriminating models. Causal and logical reasoning correlate with both static and interactive tests, while differences emerge regarding core executive functions and social/emotional skills, which correlate more with games. We advocate the development of new interactive benchmarks and targeted cognitive tasks inspired by assessing human abilities but designed specifically for LLMs.

benchmark, large language model, natural language, (15 more...)

arXiv.org Artificial Intelligence

2502.14359

Country:

Asia (0.93)
North America > United States (0.28)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.14)

Genre: Research Report > New Finding (0.68)

Industry:

Health & Medicine > Therapeutic Area > Psychiatry/Psychology (1.00)
Health & Medicine > Therapeutic Area > Neurology (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Cognitive Science (1.00)

Add feedback

Ad-hoc Concept Forming in the Game Codenames as a Means for Evaluating Large Language Models

Hakimov, Sherzod, Pfennigschmidt, Lara, Schlangen, David

arXiv.org Artificial IntelligenceFeb-17-2025

This study utilizes the game Codenames as a benchmarking tool to evaluate large language models (LLMs) with respect to specific linguistic and cognitive skills. LLMs play each side of the game, where one side generates a clue word covering several target words and the other guesses those target words. We designed various experiments by controlling the choice of words (abstract vs. concrete words, ambiguous vs. monosemic) or the opponent (programmed to be faster or slower in revealing words). Recent commercial and open-weight models were compared side-by-side to find out factors affecting their performance. The evaluation reveals details about their strategies, challenging cases, and limitations of LLMs.

large language model, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

2502.11707

Country:

North America > United States (0.46)
Europe > Austria (0.29)
Europe > Germany (0.28)
North America > Mexico (0.28)

Genre: Research Report (0.64)

Industry: Leisure & Entertainment > Games (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.54)

Add feedback

Plant in Cupboard, Orange on Table, Book on Shelf. Benchmarking Practical Reasoning and Situation Modelling in a Text-Simulated Situated Environment

Jordan, Jonathan, Hakimov, Sherzod, Schlangen, David

arXiv.org Artificial IntelligenceFeb-17-2025

Large language models (LLMs) have risen to prominence as 'chatbots' for users to interact via natural language. However, their abilities to capture common-sense knowledge make them seem promising as language-based planners of situated or embodied action as well. We have implemented a simple text-based environment -- similar to others that have before been used for reinforcement-learning of agents -- that simulates, very abstractly, a household setting. We use this environment and the detailed error-tracking capabilities we implemented for targeted benchmarking of LLMs on the problem of practical reasoning: Going from goals and observations to actions. Our findings show that environmental complexity and game restrictions hamper performance, and concise action planning is demanding for current LLMs.

artificial intelligence, large language model, natural language, (19 more...)

arXiv.org Artificial Intelligence

2502.11733

Country: Europe > Germany (0.28)

Genre: Research Report > New Finding (0.68)

Industry: Leisure & Entertainment > Games > Computer Games (0.46)

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

Add feedback

Incremental Dialogue Management: Survey, Discussion, and Implications for HRI

Kennington, Casey, Lison, Pierre, Schlangen, David

arXiv.org Artificial IntelligenceJan-1-2025

Efforts towards endowing robots with the ability to speak have benefited from recent advancements in NLP, in particular large language models. However, as powerful as current models have become, they still operate on sentence or multi-sentence level input, not on the word-by-word input that humans operate on, affecting the degree of responsiveness that they offer, which is critical in situations where humans interact with robots using speech. In this paper, we review the literature on interactive systems that operate incrementally (i.e., at the word level or below it). We motivate the need for incremental systems, survey incremental modeling of important aspects of dialogue like speech recognition and language generation. Primary focus is on the part of the system that makes decisions, known as the dialogue manager. We find that there is very little research on incremental dialogue management, offer some requirements for practical incremental dialogue management, and the implications of incremental dialogue for embodied, robotic platforms.

artificial intelligence, machine learning, natural language, (21 more...)

arXiv.org Artificial Intelligence

2501.00953

Country:

Asia (1.00)
North America > United States (0.46)
Europe > Germany (0.28)

Genre: Overview (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Discourse & Dialogue (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.93)

Add feedback

LLMs as Function Approximators: Terminology, Taxonomy, and Questions for Evaluation

Schlangen, David

arXiv.org Artificial IntelligenceJul-18-2024

Natural Language Processing has moved rather quickly from modelling specific tasks to taking more general pre-trained models and fine-tuning them for specific tasks, to a point where we now have what appear to be inherently generalist models. This paper argues that the resultant loss of clarity on what these models model leads to metaphors like "artificial general intelligences" that are not helpful for evaluating their strengths and weaknesses. The proposal is to see their generality, and their potential value, in their ability to approximate specialist function, based on a natural language specification. This framing brings to the fore questions of the quality of the approximation, but beyond that, also questions of discoverability, stability, and protectability of these functions. As the paper will show, this framing hence brings together in one conceptual framework various aspects of evaluation, both from a practical and a theoretical perspective, as well as questions often relegated to a secondary status (such as "prompt injection" and "jailbreaking").

large language model, natural language, source document, (18 more...)

arXiv.org Artificial Intelligence

2407.13744

Country: North America > United States > Massachusetts (0.14)

Genre:

Research Report (0.70)
Overview (0.46)

Industry: Banking & Finance > Economy (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.70)

Add feedback

A Dialogue Game for Eliciting Balanced Collaboration

Jeknić, Isidora, Schlangen, David, Koller, Alexander

arXiv.org Artificial IntelligenceJul-11-2024

Collaboration is an integral part of human dialogue. Typical task-oriented dialogue games assign asymmetric roles to the participants, which limits their ability to elicit naturalistic role-taking in collaboration and its negotiation. We present a novel and simple online setup that favors balanced collaboration: a two-player 2D object placement game in which the players must negotiate the goal state themselves. We show empirically that human players exhibit a variety of role distributions, and that balanced collaboration improves task performance. We also present an LLM-based baseline agent which demonstrates that automatic playing of our game is an interesting challenge for artificial systems.

artificial intelligence, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2406.08202

Country:

Europe > Germany > Saarland (0.15)
Europe > United Kingdom > England (0.14)

Genre: Research Report (0.50)

Industry: Leisure & Entertainment > Games (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning (0.94)

Add feedback

LLMs instead of Human Judges? A Large Scale Empirical Study across 20 NLP Evaluation Tasks

Bavaresco, Anna, Bernardi, Raffaella, Bertolazzi, Leonardo, Elliott, Desmond, Fernández, Raquel, Gatt, Albert, Ghaleb, Esam, Giulianelli, Mario, Hanna, Michael, Koller, Alexander, Martins, André F. T., Mondorf, Philipp, Neplenbroek, Vera, Pezzelle, Sandro, Plank, Barbara, Schlangen, David, Suglia, Alessandro, Surikuchi, Aditya K, Takmaz, Ece, Testoni, Alberto

arXiv.org Artificial IntelligenceJun-26-2024

There is an increasing trend towards evaluating NLP models with LLM-generated judgments instead of human judgments. In the absence of a comparison against human data, this raises concerns about the validity of these evaluations; in case they are conducted with proprietary models, this also raises concerns over reproducibility. We provide JUDGE-BENCH, a collection of 20 NLP datasets with human annotations, and comprehensively evaluate 11 current LLMs, covering both open-weight and proprietary models, for their ability to replicate the annotations. Our evaluations show that each LLM exhibits a large variance across datasets in its correlation to human judgments. We conclude that LLMs are not yet ready to systematically replace human judges in NLP.

computational linguistic, large language model, machine learning, (19 more...)

arXiv.org Artificial Intelligence

2406.18403

Country:

Asia (0.68)
Europe > Middle East > Malta (0.14)
North America > United States > Minnesota (0.14)
(2 more...)

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Retrieval-Augmented Code Generation for Situated Action Generation: A Case Study on Minecraft

Kranti, Chalamalasetti, Hakimov, Sherzod, Schlangen, David

arXiv.org Artificial IntelligenceJun-25-2024

In the Minecraft Collaborative Building Task, two players collaborate: an Architect (A) provides instructions to a Builder (B) to assemble a specified structure using 3D blocks. In this work, we investigate the use of large language models (LLMs) to predict the sequence of actions taken by the Builder. Leveraging LLMs' in-context learning abilities, we use few-shot prompting techniques, that significantly improve performance over baseline methods. Additionally, we present a detailed analysis of the gaps in performance for future work

large language model, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2406.17553

Country:

Europe (1.00)
North America > United States > California > San Francisco County > San Francisco (0.14)

Genre:

Research Report (0.50)
Overview (0.46)

Industry: Leisure & Entertainment > Games > Computer Games (0.63)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.31)

Add feedback

How Many Parameters Does it Take to Change a Light Bulb? Evaluating Performance in Self-Play of Conversational Games as a Function of Model Characteristics

Bhavsar, Nidhir, Jordan, Jonathan, Hakimov, Sherzod, Schlangen, David

arXiv.org Artificial IntelligenceJun-20-2024

What makes a good Large Language Model (LLM)? That it performs well on the relevant benchmarks -- which hopefully measure, with some validity, the presence of capabilities that are also challenged in real application. But what makes the model perform well? What gives a model its abilities? We take a recently introduced type of benchmark that is meant to challenge capabilities in a goal-directed, agentive context through self-play of conversational games, and analyse how performance develops as a function of model characteristics like number of parameters, or type of training. We find that while there is a clear relationship between number of parameters and performance, there is still a wide spread of performance points within a given size bracket, which is to be accounted for by training parameters such as fine-tuning data quality and method. From a more practical angle, we also find a certain degree of unpredictability about performance across access methods, possible due to unexposed sampling parameters, and a, very welcome, performance stability against at least moderate weight quantisation during inference.

huggingface, large language model, machine learning, (19 more...)

arXiv.org Artificial Intelligence

2406.14051

Country: Europe > Germany (0.28)

Genre: Research Report (0.64)

Industry: Leisure & Entertainment > Games (0.48)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.78)

Add feedback