AITopics | speaker model

Speaker-Follower Models for Vision-and-Language Navigation

Neural Information Processing SystemsMar-16-2026, 21:55:36 GMT

Navigation guided by natural language instructions presents a challenging reasoning problem for instruction followers. Natural language instructions typically identify only a few high-level decisions and landmarks rather than complete low-level motor behaviors; much of the missing information must be inferred based on perceptual context. In machine learning settings, this is doubly challenging: it is difficult to collect enough annotated data to enable learning of this reasoning process from scratch, and also difficult to implement the reasoning process using generic sequence models. Here we describe an approach to vision-and-language navigation that addresses both these issues with an embedded speaker model. We use this speaker model to (1) synthesize new instructions for data augmentation and to (2) implement pragmatic reasoning, which evaluates how well candidate action sequences explain an instruction. Both steps are supported by a panoramic action space that reflects the granularity of human-generated instructions. Experiments show that all three components of this approach---speaker-driven data augmentation, pragmatic reasoning and panoramic action space---dramatically improve the performance of a baseline instruction follower, more than doubling the success rate over the best existing approach on a standard benchmark.

artificial intelligence, machine learning, proceedings, (8 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning (0.40)

Add feedback

Speaker-FollowerModelsfor Vision-and-LanguageNavigation

Neural Information Processing SystemsFeb-13-2026, 02:41:34 GMT

Navigation guided by natural language instructions presents a challenging reasoning problem forinstruction followers.

artificial intelligence, machine learning, natural language, (17 more...)

Neural Information Processing Systems

Country:

North America > United States > Texas > Travis County > Austin (0.04)
North America > Canada > Quebec > Montreal (0.04)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

4b8eaf3bcdc105423a972ed90eb07217-Paper-Conference.pdf

Neural Information Processing SystemsFeb-12-2026, 16:22:35 GMT

large language model, machine learning, natural language, (22 more...)

Neural Information Processing Systems

Country:

North America > United States (0.14)
North America > Canada (0.04)
Asia > Singapore (0.04)
(2 more...)

Genre:

Research Report > Experimental Study (1.00)
Research Report > New Finding (0.92)

Industry:

Media (0.67)
Education (0.67)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.94)
(2 more...)

Add feedback

Speaker-Follower Models for Vision-and-Language Navigation

Neural Information Processing SystemsNov-20-2025, 22:22:57 GMT

Navigation guided by natural language instructions presents a challenging reasoning problem for instruction followers. Natural language instructions typically identify only a few high-level decisions and landmarks rather than complete low-level motor behaviors; much of the missing information must be inferred based on perceptual context. In machine learning settings, this is doubly challenging: it is difficult to collect enough annotated data to enable learning of this reasoning process from scratch, and also difficult to implement the reasoning process using generic sequence models. Here we describe an approach to vision-and-language navigation that addresses both these issues with an embedded speaker model. We use this speaker model to (1) synthesize new instructions for data augmentation and to (2) implement pragmatic reasoning, which evaluates how well candidate action sequences explain an instruction. Both steps are supported by a panoramic action space that reflects the granularity of human-generated instructions. Experiments show that all three components of this approach---speaker-driven data augmentation, pragmatic reasoning and panoramic action space---dramatically improve the performance of a baseline instruction follower, more than doubling the success rate over the best existing approach on a standard benchmark.

name change, speaker-follower model, vision-and-language navigation, (7 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning (0.40)

Add feedback

Speaker-Follower Models for Vision-and-Language Navigation Daniel Fried

Neural Information Processing SystemsNov-20-2025, 17:11:23 GMT

Navigation guided by natural language instructions presents a challenging reasoning problem for instruction followers.

machine learning, natural language, reinforcement learning, (16 more...)

Neural Information Processing Systems

Country:

North America > United States > Texas > Travis County > Austin (0.04)
North America > United States > California > San Diego County > San Diego (0.04)
North America > United States > California > Alameda County > Berkeley (0.04)
North America > Canada > Quebec > Montreal (0.04)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
(3 more...)

Add feedback

Success and Cost Elicit Convention Formation for Efficient Communication

Vaduguru, Saujas, Hua, Yilun, Artzi, Yoav, Fried, Daniel

arXiv.org Artificial IntelligenceOct-29-2025

Humans leverage shared conversational context to become increasingly successful and efficient at communicating over time. One manifestation of this is the formation of ad hoc linguistic conventions, which allow people to coordinate on short, less costly utterances that are understood using shared conversational context. We present a method to train large multimodal models to form conventions, enabling efficient communication. Our approach uses simulated reference games between models, and requires no additional human-produced data. In repeated reference games involving photographs and tangram images, our method enables models to communicate efficiently with people: reducing the message length by up to 41% while increasing success by 15% over the course of the interaction. Human listeners respond faster when interacting with our model that forms conventions. We also show that training based on success or cost alone is insufficient - both are necessary to elicit convention formation.

artificial intelligence, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2510.24023

Country:

North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.04)
North America > United States > Florida > Miami-Dade County > Miami (0.04)
Europe > Italy > Tuscany > Florence (0.04)
(2 more...)

Genre:

Research Report (1.00)
Questionnaire & Opinion Survey (0.68)

Industry: Leisure & Entertainment > Games (0.68)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (0.93)

Add feedback

4b8eaf3bcdc105423a972ed90eb07217-Paper-Conference.pdf

Neural Information Processing SystemsOct-10-2025, 01:39:45 GMT

calibration, listener, listener model, (17 more...)

Neural Information Processing Systems

Country:

North America > United States (0.14)
North America > Canada (0.04)
Asia > Singapore (0.04)
(2 more...)

Genre:

Research Report > Experimental Study (1.00)
Research Report > New Finding (0.92)

Industry:

Media (0.67)
Education (0.67)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.94)
(2 more...)

Add feedback

Speaker effects in spoken language comprehension

Wu, Hanlin, Cai, Zhenguang G.

arXiv.org Artificial IntelligenceDec-10-2024

The identity of a speaker significantly influences spoken language comprehension by affecting both perception and expectation. This review explores speaker effects, focusing on how speaker information impacts language processing. We propose an integrative model featuring the interplay between bottom-up perception-based processes driven by acoustic details and top-down expectation-based processes driven by a speaker model. The acoustic details influence lower-level perception, while the speaker model modulates both lower-level and higher-level processes such as meaning interpretation and pragmatic inferences. We define speaker-idiosyncrasy and speaker-demographics effects and demonstrate how bottom-up and top-down processes interact at various levels in different scenarios. This framework contributes to psycholinguistic theory by offering a comprehensive account of how speaker information interacts with linguistic content to shape message construction. We suggest that speaker effects can serve as indices of a language learner's proficiency and an individual's characteristics of social cognition. We encourage future research to extend these findings to AI speakers, probing the universality of speaker effects across humans and artificial agents.

artificial intelligence, comprehension, natural language, (19 more...)

arXiv.org Artificial Intelligence

2412.07238

Country:

Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
North America > United States > Pennsylvania (0.04)
Asia > Indonesia (0.04)
Asia > China > Hong Kong (0.04)

Genre: Research Report > New Finding (0.93)

Industry:

Health & Medicine > Therapeutic Area > Neurology (1.00)
Health & Medicine > Therapeutic Area > Psychiatry/Psychology (0.68)
Media (0.67)
Education (0.67)

Technology:

Information Technology > Artificial Intelligence > Speech (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Cognitive Science (1.00)
(2 more...)

Add feedback

Reviews: Speaker-Follower Models for Vision-and-Language Navigation

Neural Information Processing SystemsOct-7-2024, 12:19:39 GMT

This paper builds upon the indoor vision and language-grounded navigation task and sequence-to-sequence model described in (Anderson et al, 2017), by introducing three improvements: 1) An encoder-decoder-like architecture, dubbed "speaker-follower" model, that not only decodes natural language instructions into a sequence of navigation actions using seq2seq, but also decodes a sequence of navigation actions and of image features into a sequence of natural language instructions using a symmetric seq2seq. That speaker model can then be used for scoring candidate routes (i.e., candidate sequences of images and actions) w.r.t. the likelihood of the natural language instruction under the speaker model. This enables a form of planning for the seq2seq-based agent. The image and motion are decomposed into 12 yaw and 3 pitch angles. The authors achieve state-of-the-art performance on the task and do a good ablation analysis of the impacts of their 3 improvements, although I would have liked to see navigation attention maps in the appendix as well.

natural language instruction, speaker-follower model, vision-and-language navigation, (8 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Natural Language (0.85)

Add feedback

Grounding Language in Multi-Perspective Referential Communication

Tang, Zineng, Mao, Lingjun, Suhr, Alane

arXiv.org Artificial IntelligenceOct-4-2024

We introduce a task and dataset for referring expression generation and comprehension in multi-agent embodied environments. In this task, two agents in a shared scene must take into account one another's visual perspective, which may be different from their own, to both produce and understand references to objects in a scene and the spatial relations between them. We collect a dataset of 2,970 humanwritten referring expressions, each paired with human comprehension judgments, and evaluate the performance of automated models as speakers and listeners paired with human partners, finding that model performance in both reference generation and comprehension lags behind that of pairs of human agents. Finally, we experiment training an open-weight speaker model with evidence of communicative success Figure 1: Example scene from our environment and when paired with a listener, resulting in dataset. The center image shows the speaker on the left an improvement from 58.9 to 69.3% in communicative and the listener on the right with their respective fields success and even outperforming the of view (FOV). The speaker refers to the target object, strongest proprietary model.

artificial intelligence, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

2410.03959

Country: