Goto

Collaborating Authors

 dialogue



Chatting Makes Perfect: Chat-based Image Retrieval Supplementary Material

Neural Information Processing Systems

In Appendix A, we start by showing more qualitative results of chats and their retrieval results, and BLIP2 chats compared to a human answerer. Next, in Appendix B, we present the few shot instructional prompts that were used by different LLMs for generating follow-up questions. Another example in Figure 2 describes two trains, searched by the text "A train that is parked next to another train". Figure 3 demonstrates a case where the description "a small and dirty kitchen with pots and food everywhere" is ambiguous, subjective to the viewer and may match many images in the corpus. In Figure 4 we show an example of a dialog between ChatIR and a human.







The Danger of Reducing America's Venezuela Invasion to a 60-Second Video

WIRED

January 3 marked the return of US military intervention in Latin America. While the events unfolded between Caracas and Brooklyn, social networks had already fabricated their own reality. A fire is seen in the distance at Fort Tiuna, Venezuela's largest military complex, following a series of explosions in Caracas on January 3, 2026. Geopolitics are being reduced to videos lasting just a few minutes. Social media has surpassed traditional media, not only in the speed with which it is created and shared, but also in its ability to frame our reality. People have the illusion of knowing what is happening and why within just a few hours--or less--of major world events. But reality is more complicated.


Robohub highlights 2025

Robohub

Over the course of the year, we've had the pleasure of working with many talented researchers from across the globe. As 2025 draws to a close, we take a look back at some of the excellent blog posts, interviews and podcasts from our contributors. Jiahui Zhang and Jesse Zhang to tell us about their framework for learning robot manipulation tasks solely from language instructions without per-task demonstrations. Hui Zhang writes about work presented at CoRL2025 on RobustDexGrasp, a novel framework that tackles different grasping challenges with targeted solutions. In this podcast from AAAI, host Ella Lan asked Professor Marynel Vázquez about what inspired her research direction, how her perspective on human-robot interactions has changed over time, robots navigating the social world, and more.


SpokenWOZ: A Large-Scale Speech-Text Benchmark for Spoken Task-Oriented Dialogue Agents

Neural Information Processing Systems

Task-oriented dialogue (TOD) models have made significant progress in recent years. However, previous studies primarily focus on datasets written by annotators, which has resulted in a gap between academic research and real-world spoken conversation scenarios. While several small-scale spoken TOD datasets are proposed to address robustness issues such as ASR errors, they ignore the unique challenges in spoken conversation. To tackle the limitations, we introduce SpokenWOZ, a large-scale speech-text dataset for spoken TOD, containing 8 domains, 203k turns, 5.7k dialogues and 249 hours of audios from human-to-human spoken conversations. SpokenWOZ further incorporates common spoken characteristics such as word-by-word processing and reasoning in spoken language. Based on these characteristics, we present cross-turn slot and reasoning slot detection as new challenges. We conduct experiments on various baselines, including text-modal models, newly proposed dual-modal models, and LLMs, e.g., ChatGPT. The results show that the current models still have substantial room for improvement in spoken conversation, where the most advanced dialogue state tracker only achieves 25.65% in joint goal accuracy and the SOTA end-to-end model only correctly completes the user request in 52.1% of dialogues.