Goto

Collaborating Authors

 zoom


Tuning into the future of collaboration

MIT Technology Review

Intelligent audio and intuitive tools are transforming collaboration from connection to creativity, says Sam Sabet, chief technology officer at Shure, and Brendan Ittelson, chief ecosystem officer at Zoom. When work went remote, the sound of business changed. What began as a scramble to make home offices functional has evolved into a revolution in how people hear and are heard. From education to enterprises, companies across industries have reimagined what clear, reliable communication can mean in a hybrid world. For major audio and communications enterprises like Shure and Zoom, that transformation has been powered by artificial intelligence, new acoustic technologies, and a shared mission: making connection effortless. Necessity during the pandemic accelerated years of innovation in months. Audio and video just working is a baseline for collaboration, says chief ecosystem officer at Zoom, Brendan Ittelson. That expectation has shifted from connecting people to enhancing productivity and creativity across the entire ecosystem. Audio is a foundation for trust, understanding, and collaboration.


Inside the App Where Queer Gooners Run Free

WIRED

In light of Zoom crackdowns and Skype shutting down, Batemates has emerged as an alternative for "bators" who like masturbating together online. One night not long ago, Jaxon Roman sat naked in front of his laptop wearing only a pup hood as he masturbated with single-minded zeal to the attention of eight other men watching onscreen. It was a typical weekday for the 33-year-old Arlington, Virginia, program analyst. "When bros praise me and say they're enjoying [me], I get to that edge point so fast," Roman says. His favorite instances are "when they all come to what I'm doing." Sometimes, when he's feeling especially kinky, Roman, who is bisexual, likes to ask for permission before climaxing.



Zoom in, Click out: Unlocking and Evaluating the Potential of Zooming for GUI Grounding

Jiang, Zhiyuan, Xie, Shenghao, Li, Wenyi, Zu, Wenqiang, Li, Peihang, Qiu, Jiahao, Pei, Siqi, Ma, Lei, Huang, Tiejun, Wang, Mengdi, Liu, Shilong

arXiv.org Artificial Intelligence

Grounding is a fundamental capability for building graphical user interface (GUI) agents. Although existing approaches rely on large-scale bounding box supervision, they still face various challenges, such as cross-platform generalization, complex layout analysis, and fine-grained element localization. In this paper, we investigate zoom as a strong yet underexplored prior for GUI grounding, and propose a training-free method, ZoomClick. By characterizing four key properties of zoom (i.e., pre-zoom, depth, shrink size, minimal crop size), we unlock its full capabilities for dynamic spatial focusing and adaptive context switching. Experiments demonstrate that our method significantly boosts the performance of both general vision-language and specialized GUI grounding models, achieving state-of-the-art results on several mainstream benchmarks; for example, UI-Venus-72B attains a 73.1% success rate on ScreenSpot-Pro. Furthermore, we present GUIZoom-Bench, a benchmark for evaluating model adaptability to zoom, aiming to inspire future research on improving zoom for further training and test-time scaling in GUI grounding tasks.



Look, Zoom, Understand: The Robotic Eyeball for Embodied Perception

Yang, Jiashu, Han, Yifan, Xie, Yucheng, Guo, Ning, Lian, Wenzhao

arXiv.org Artificial Intelligence

In embodied AI perception systems, visual perception should be active: the goal is not to passively process static images, but to actively acquire more informative data within pixel and spatial budget constraints. Existing vision models and fixed RGB-D camera systems fundamentally fail to reconcile wide-area coverage with fine-grained detail acquisition, severely limiting their efficacy in open-world robotic applications. T o address this issue, we propose EyeVLA,a robotic eyeball for active visual perception that can take proactive actions based on instructions, enabling clear observation of fine-grained target objects and detailed information across a wide spatial extent. EyeVLA discretizes action behaviors into action tokens and integrates them with vision-language models (VLMs) that possess strong open-world understanding capabilities, enabling joint modeling of vision, language, and actions within a single autore-gressive sequence. By using the 2D bounding box coordinates to guide the reasoning chain and applying reinforcement learning to refine the viewpoint selection policy, we transfer the open world scene understanding capability of the VLM to a vision language action (VLA) policy using only minimal real-world data.Experiments show that EyeVLA can effectively understand scenes in real-world environments and actively acquire more accurate visual information through instruction-driven actions of rotation and zoom, thereby achieving strong environmental perception capabilities. EyeVLA introduces a novel robotic vision paradigm: under pixel and spatial budgets, it dynamically acquires dynamically acquires highly informative visual data within given pixel and spatial budgets for environmental perception in multimodal autonomous systems.



172fd0d638b3282151bd8f3d652cb640-AuthorFeedback.pdf

Neural Information Processing Systems

The number of parameters is calculated for the CUB dataset. We first thank all reviewers for the valuable feedback. As shown in Table 1, our model outperforms Resnet152 by 3.6%(71.8% We will add more detailed analysis in the final version of the paper. Besides, we observe more maps introduce the attention redundancy, i.e. maps attend to the same region.


Thinking with Video: Video Generation as a Promising Multimodal Reasoning Paradigm

Tong, Jingqi, Mou, Yurong, Li, Hangcheng, Li, Mingzhe, Yang, Yongzhuo, Zhang, Ming, Chen, Qiguang, Liang, Tianyi, Hu, Xiaomeng, Zheng, Yining, Chen, Xinchi, Zhao, Jun, Huang, Xuanjing, Qiu, Xipeng

arXiv.org Artificial Intelligence

"Thinking with Text" and "Thinking with Images" paradigm significantly improve the reasoning ability of large language models (LLMs) and Vision Language Models (VLMs). However, these paradigms have inherent limitations. (1) Images capture only single moments and fail to represent dynamic processes or continuous changes, and (2) The separation of text and vision as distinct modalities, hindering unified multimodal understanding and generation. To overcome these limitations, we introduce "Thinking with Video", a new paradigm that leverages video generation models, such as Sora-2, to bridge visual and textual reasoning in a unified temporal framework. To support this exploration, we developed the Video Thinking Benchmark (VideoThinkBench). VideoThinkBench encompasses two task categories: (1) vision-centric tasks (e.g., Eyeballing Puzzles), and (2) text-centric tasks (e.g., subsets of GSM8K, MMMU). Our evaluation establishes Sora-2 as a capable reasoner. On vision-centric tasks, Sora-2 is generally comparable to state-of-the-art (SOTA) VLMs, and even surpasses VLMs on several tasks, such as Eyeballing Games. On text-centric tasks, Sora-2 achieves 92% accuracy on MATH, and 75.53% accuracy on MMMU. Furthermore, we systematically analyse the source of these abilities. We also find that self-consistency and in-context learning can improve Sora-2's performance. In summary, our findings demonstrate that the video generation model is the potential unified multimodal understanding and generation model, positions "thinking with video" as a unified multimodal reasoning paradigm.


172fd0d638b3282151bd8f3d652cb640-AuthorFeedback.pdf

Neural Information Processing Systems

The number of parameters is calculated for the CUB dataset. We first thank all reviewers for the valuable feedback. As shown in Table 1, our model outperforms Resnet152 by 3.6%(71.8% We will add more detailed analysis in the final version of the paper. Besides, we observe more maps introduce the attention redundancy, i.e. maps attend to the same region.