Goto

Collaborating Authors

 speaker model


Speaker-Follower Models for Vision-and-Language Navigation

Neural Information Processing Systems

Navigation guided by natural language instructions presents a challenging reasoning problem for instruction followers. Natural language instructions typically identify only a few high-level decisions and landmarks rather than complete low-level motor behaviors; much of the missing information must be inferred based on perceptual context. In machine learning settings, this is doubly challenging: it is difficult to collect enough annotated data to enable learning of this reasoning process from scratch, and also difficult to implement the reasoning process using generic sequence models. Here we describe an approach to vision-and-language navigation that addresses both these issues with an embedded speaker model. We use this speaker model to (1) synthesize new instructions for data augmentation and to (2) implement pragmatic reasoning, which evaluates how well candidate action sequences explain an instruction. Both steps are supported by a panoramic action space that reflects the granularity of human-generated instructions. Experiments show that all three components of this approach---speaker-driven data augmentation, pragmatic reasoning and panoramic action space---dramatically improve the performance of a baseline instruction follower, more than doubling the success rate over the best existing approach on a standard benchmark.




Speaker-Follower Models for Vision-and-Language Navigation

Neural Information Processing Systems

Navigation guided by natural language instructions presents a challenging reasoning problem for instruction followers. Natural language instructions typically identify only a few high-level decisions and landmarks rather than complete low-level motor behaviors; much of the missing information must be inferred based on perceptual context. In machine learning settings, this is doubly challenging: it is difficult to collect enough annotated data to enable learning of this reasoning process from scratch, and also difficult to implement the reasoning process using generic sequence models. Here we describe an approach to vision-and-language navigation that addresses both these issues with an embedded speaker model. We use this speaker model to (1) synthesize new instructions for data augmentation and to (2) implement pragmatic reasoning, which evaluates how well candidate action sequences explain an instruction. Both steps are supported by a panoramic action space that reflects the granularity of human-generated instructions. Experiments show that all three components of this approach---speaker-driven data augmentation, pragmatic reasoning and panoramic action space---dramatically improve the performance of a baseline instruction follower, more than doubling the success rate over the best existing approach on a standard benchmark.



Success and Cost Elicit Convention Formation for Efficient Communication

Vaduguru, Saujas, Hua, Yilun, Artzi, Yoav, Fried, Daniel

arXiv.org Artificial Intelligence

Humans leverage shared conversational context to become increasingly successful and efficient at communicating over time. One manifestation of this is the formation of ad hoc linguistic conventions, which allow people to coordinate on short, less costly utterances that are understood using shared conversational context. We present a method to train large multimodal models to form conventions, enabling efficient communication. Our approach uses simulated reference games between models, and requires no additional human-produced data. In repeated reference games involving photographs and tangram images, our method enables models to communicate efficiently with people: reducing the message length by up to 41% while increasing success by 15% over the course of the interaction. Human listeners respond faster when interacting with our model that forms conventions. We also show that training based on success or cost alone is insufficient - both are necessary to elicit convention formation.



Speaker effects in spoken language comprehension

Wu, Hanlin, Cai, Zhenguang G.

arXiv.org Artificial Intelligence

The identity of a speaker significantly influences spoken language comprehension by affecting both perception and expectation. This review explores speaker effects, focusing on how speaker information impacts language processing. We propose an integrative model featuring the interplay between bottom-up perception-based processes driven by acoustic details and top-down expectation-based processes driven by a speaker model. The acoustic details influence lower-level perception, while the speaker model modulates both lower-level and higher-level processes such as meaning interpretation and pragmatic inferences. We define speaker-idiosyncrasy and speaker-demographics effects and demonstrate how bottom-up and top-down processes interact at various levels in different scenarios. This framework contributes to psycholinguistic theory by offering a comprehensive account of how speaker information interacts with linguistic content to shape message construction. We suggest that speaker effects can serve as indices of a language learner's proficiency and an individual's characteristics of social cognition. We encourage future research to extend these findings to AI speakers, probing the universality of speaker effects across humans and artificial agents.


Reviews: Speaker-Follower Models for Vision-and-Language Navigation

Neural Information Processing Systems

This paper builds upon the indoor vision and language-grounded navigation task and sequence-to-sequence model described in (Anderson et al, 2017), by introducing three improvements: 1) An encoder-decoder-like architecture, dubbed "speaker-follower" model, that not only decodes natural language instructions into a sequence of navigation actions using seq2seq, but also decodes a sequence of navigation actions and of image features into a sequence of natural language instructions using a symmetric seq2seq. That speaker model can then be used for scoring candidate routes (i.e., candidate sequences of images and actions) w.r.t. the likelihood of the natural language instruction under the speaker model. This enables a form of planning for the seq2seq-based agent. The image and motion are decomposed into 12 yaw and 3 pitch angles. The authors achieve state-of-the-art performance on the task and do a good ablation analysis of the impacts of their 3 improvements, although I would have liked to see navigation attention maps in the appendix as well.


Grounding Language in Multi-Perspective Referential Communication

Tang, Zineng, Mao, Lingjun, Suhr, Alane

arXiv.org Artificial Intelligence

We introduce a task and dataset for referring expression generation and comprehension in multi-agent embodied environments. In this task, two agents in a shared scene must take into account one another's visual perspective, which may be different from their own, to both produce and understand references to objects in a scene and the spatial relations between them. We collect a dataset of 2,970 humanwritten referring expressions, each paired with human comprehension judgments, and evaluate the performance of automated models as speakers and listeners paired with human partners, finding that model performance in both reference generation and comprehension lags behind that of pairs of human agents. Finally, we experiment training an open-weight speaker model with evidence of communicative success Figure 1: Example scene from our environment and when paired with a listener, resulting in dataset. The center image shows the speaker on the left an improvement from 58.9 to 69.3% in communicative and the listener on the right with their respective fields success and even outperforming the of view (FOV). The speaker refers to the target object, strongest proprietary model.