Goto

Collaborating Authors

 Kennington, Casey


Incremental Dialogue Management: Survey, Discussion, and Implications for HRI

arXiv.org Artificial Intelligence

Efforts towards endowing robots with the ability to speak have benefited from recent advancements in NLP, in particular large language models. However, as powerful as current models have become, they still operate on sentence or multi-sentence level input, not on the word-by-word input that humans operate on, affecting the degree of responsiveness that they offer, which is critical in situations where humans interact with robots using speech. In this paper, we review the literature on interactive systems that operate incrementally (i.e., at the word level or below it). We motivate the need for incremental systems, survey incremental modeling of important aspects of dialogue like speech recognition and language generation. Primary focus is on the part of the system that makes decisions, known as the dialogue manager. We find that there is very little research on incremental dialogue management, offer some requirements for practical incremental dialogue management, and the implications of incremental dialogue for embodied, robotic platforms.


Renaissance: Investigating the Pretraining of Vision-Language Encoders

arXiv.org Artificial Intelligence

In the past several years there has been an explosion of available models for vision-language tasks. Unfortunately, the literature still leaves open a number of questions related to best practices in designing and training such models. In this paper we seek to answer several questions related to the pretraining of vision-language encoders through meta-analysis. In our first set of experiments, we show that we can save significant compute at no cost to downstream performance, by freezing large parts of vision-language models during pretraining. In our second set of experiments we examine the effect of basing a VL transformer on a vision model versus a text model. Additionally, we introduce a VL modeling platform called Renaissance that we use to conduct all of the experiments. This program offers a great deal of flexibility in creating, training and evaluating transformer encoders for VL modeling. The source code for Renaissance can be found at https://github.com/bsu-slim/renaissance.


Unsupervised, Bottom-up Category Discovery for Symbol Grounding with a Curious Robot

arXiv.org Artificial Intelligence

Towards addressing the Symbol Grounding Problem and motivated by early childhood language development, we leverage a robot which has been equipped with an approximate model of curiosity with particular focus on bottom-up building of unsupervised categories grounded in the physical world. That is, rather than starting with a top-down symbol (e.g., a word referring to an object) and providing meaning through the application of predetermined samples, the robot autonomously and gradually breaks up its exploration space into a series of increasingly specific unlabeled categories at which point an external expert may optionally provide a symbol association. We extend prior work by using a robot that can observe the visual world, introducing a higher dimensional sensory space, and using a more generalizable method of category building. Our experiments show that the robot learns categories based on actions and what it visually observes, and that those categories can be symbolically grounded into.https://info.arxiv.org/help/prep#comments


Dialogue with Robots: Proposals for Broadening Participation and Research in the SLIVAR Community

arXiv.org Artificial Intelligence

The ability to interact with machines using natural human language is becoming not just commonplace, but expected. The next step is not just text interfaces, but speech interfaces and not just with computers, but with all machines including robots. In this paper, we chronicle the recent history of this growing field of spoken dialogue with robots and offer the community three proposals, the first focused on education, the second on benchmarks, and the third on the modeling of language when it comes to spoken interaction with robots. The three proposals should act as white papers for any researcher to take and build upon.


Understanding Survey Paper Taxonomy about Large Language Models via Graph Representation Learning

arXiv.org Artificial Intelligence

As new research on Large Language Models (LLMs) continues, it is difficult to keep up with new research and models. To help researchers synthesize the new research many have written survey papers, but even those have become numerous. In this paper, we develop a method to automatically assign survey papers to a taxonomy. We collect the metadata of 144 LLM survey papers and explore three paradigms to classify papers within the taxonomy. Our work indicates that leveraging graph structure information on co-category graphs can significantly outperform the language models in two paradigms; pre-trained language models' fine-tuning and zero-shot/few-shot classifications using LLMs. We find that our model surpasses an average human recognition level and that fine-tuning LLMs using weak labels generated by a smaller model, such as the GCN in this study, can be more effective than using ground-truth labels, revealing the potential of weak-to-strong generalization in the taxonomy classification task.


On the Computational Modeling of Meaning: Embodied Cognition Intertwined with Emotion

arXiv.org Artificial Intelligence

How can machines understand language? is a question that many have asked, and represents an important facet of artificial intelligence. Large language models like ChatGPT seem to understand language, but as has been pointed out (Bender and Koller, 2020; Bisk et al., 2020), even large, powerful language models trained on huge amounts of data are likely missing key information to allow them to reach the depth of understanding that humans have. What information are they missing, and, perhaps more importantly, what information do they have that enables them to understand, to the degree that they do? Current computational models of semantic meaning can be broken down into three paradigms: distributional paradigms where meaning is derived from how words are used in text (i.e., the notion that the meaning of a word depends on the "company it keeps," following Firth (1957)) meaningfulness of language lies in the fact that it is about the world (Dahlgren, 1976) and grounded paradigms are where aspects of the physical world are linked to language (i.e., the symbol grounding problem following Harnad (1990)) formal paradigms where meaning is a logical form (e.g., first order logic as in L.T.F.


Vision Language Transformers: A Survey

arXiv.org Artificial Intelligence

Vision language tasks, such as answering questions about or generating captions that describe an image, are difficult tasks for computers to perform. A relatively recent body of research has adapted the pretrained transformer architecture introduced in \citet{vaswani2017attention} to vision language modeling. Transformer models have greatly improved performance and versatility over previous vision language models. They do so by pretraining models on a large generic datasets and transferring their learning to new tasks with minor changes in architecture and parameter values. This type of transfer learning has become the standard modeling practice in both natural language processing and computer vision. Vision language transformers offer the promise of producing similar advancements in tasks which require both vision and language. In this paper, we provide a broad synthesis of the currently available research on vision language transformer models and offer some analysis of their strengths, limitations and some open questions that remain.


Who's in Charge? Roles and Responsibilities of Decision-Making Components in Conversational Robots

arXiv.org Artificial Intelligence

Software architectures for conversational robots typically consist of multiple modules, each designed for a particular processing task or functionality. Some of these modules are developed for the purpose of making decisions about the next action that the robot ought to perform in the current context. Those actions may relate to physical movements, such as driving forward or grasping an object, but may also correspond to communicative acts, such as asking a question to the human user. In this position paper, we reflect on the organization of those decision modules in human-robot interaction platforms. We discuss the relative benefits and limitations of modular vs. end-to-end architectures, and argue that, despite the increasing popularity of end-to-end approaches, modular architectures remain preferable when developing conversational robots designed to execute complex tasks in collaboration with human users. We also show that most practical HRI architectures tend to be either robot-centric or dialogue-centric, depending on where developers wish to place the ``command center'' of their system. While those design choices may be justified in some application domains, they also limit the robot's ability to flexibly interleave physical movements and conversational behaviours. We contend that architectures placing ``action managers'' and ``interaction managers'' on an equal footing may provide the best path forward for future human-robot interaction systems.


Evaluating Automatic Speech Recognition in an Incremental Setting

arXiv.org Artificial Intelligence

The increasing reliability of automatic speech recognition has proliferated its everyday use. However, for research purposes, it is often unclear which model one should choose for a task, particularly if there is a requirement for speed as well as accuracy. In this paper, we systematically evaluate six speech recognizers using metrics including word error rate, latency, and the number of updates to already recognized words on English test data, as well as propose and compare two methods for streaming audio into recognizers for incremental recognition. We further propose Revokes per Second as a new metric for evaluating incremental recognition and demonstrate that it provides insights into overall model performance. We find that, generally, local recognizers are faster and require fewer updates than cloud-based recognizers. Finally, we find Meta's Wav2Vec model to be the fastest, and find Mozilla's DeepSpeech model to be the most stable in its predictions.


Conversational Agents and Children: Let Children Learn

arXiv.org Artificial Intelligence

Using online information discovery as a case study, in this position paper we discuss the need to design, develop, and deploy (conversational) agents that can -- non-intrusively -- guide children in their quest for online resources rather than simply finding resources for them. We argue that agents should "let children learn" and should be built to take on a teacher-facilitator function, allowing children to develop their technical and critical thinking abilities as they interact with varied technology in a broad range of use cases.