Goto

Collaborating Authors

 Virtual Reality


ESCA: Enabling Seamless Codec Avatar Execution through Algorithm and Hardware Co-Optimization for Virtual Reality

Neural Information Processing Systems

Photorealistic Codec Avatars (PCA), which generate high-fidelity human face renderings, are increasingly being used in Virtual Reality (VR) environments to enable immersive communication and interaction through deep learning-based generative models. However, these models impose significant computational demands, making real-time inference challenging on resource-constrained VR devices such as head-mounted displays (HMDs), where latency and power efficiency are critical. To address this challenge, we propose an efficient post-training quantization (PTQ) method tailored for Codec Avatar models, enabling low-precision execution without compromising output quality. In addition, we design a custom hardware accelerator that can be integrated into the system-on-chip (SoC) of VR devices to further enhance processing efficiency. Building on these components, we introduce ESCA, a full-stack optimization framework that accelerates PCA inference on edge VR platforms. Experimental results demonstrate that ESCA boosts FovVideoVDP quality scores by up to +0.39 over the best 4-bit baseline, delivers up to 3.36 latency reduction, and sustains a rendering rate of 100 frames per second in endto-end tests, satisfying real-time VR requirements. These results demonstrate the feasibility of deploying high-fidelity codec avatars on resource-constrained devices, opening the door to more immersive and portable VR experiences. Paper website can be found at https://zmzfpc.github.io/ESCA/.


Video Depth Estimation ModelCover FigureMerge360!imageto video

Neural Information Processing Systems

To mitigate the distortions brought by equirectangular projection, existing methods typically divide 360 images into distortion-less perspective patches. However, since these patches are processed independently, depth inconsistencies are often introduced due to scale drift among patches. Recently, video depth estimation (VDE) models have leveraged temporal consistency for stable depth predictions across frames. Inspired by this, we propose to represent a 360 image as a sequence of perspective frames, mimicking the viewpoint adjustments users make when exploring a 360 scenario in virtual reality. Thus, the spatial consistency among perspective depth patches can be enhanced by exploiting the temporal consistency inherent in VDE models. To this end, we introduce a training-free pipeline for 360 monocular depth estimation, called ST2360D.


Qualcomm unveils its Snapdragon Reality Elite chip for next-gen AR headsets

Engadget

The company also debuted a new platform for brands wanting to build their own AI glasses. High-end augmented reality and mixed reality devices are set to get a boost thanks to Qualcomm's latest XR chip. During a keynote at Augmented World Expo (AWE), the company unveiled its Snapdragon Reality Elite processor, which it says will allow the next generation of AR and mixed reality headsets to be smaller and more efficient. In terms of specs, the Snapdragon Reality Elite can support up to 4.4K resolution in each eye at 90 fps, a modest upgrade from the XR2+ Gen 2, but one that Qualcomm says will enable better image quality and lower latency. It also delivers significant improvements in terms of efficiency, with up to 20 percent boost in battery life while running up to 12 degrees Celsius (about 54 degrees Fahrenheit) cooler, compared with the XR2+ Gen 2. Performance-wise, Reality Elite comes with notable gains over the previous generation as well.


You Can Finally Buy Snap's New AR Specs--for 2,150

WIRED

You Can Finally Buy Snap's New AR Specs--for $2,195 Snap CEO Evan Spiegel lays out the company's vision for its augmented-reality smart glasses, arriving later this year. Snap--maker of the popular social app Snapchat--has a new pair of augmented-reality smart glasses called Specs. Snap CEO Evan Spiegel revealed the new glasses at an event during the Augmented World Expo (AWE) tech conference in Long Beach, California. As Snap frames it, this isn't a prototype or developer device--it's the first actual consumer version of the Specs AR glasses, unlike the previous generation exclusively sold to developers and creators. Snap says it expects the devices to ship this fall in the US, UK, and France.


DIFFSSR: Stereo Image Super-resolution Using Differential Transformer

Neural Information Processing Systems

In the field of computer vision, the task of stereo image super-resolution (StereoSR) has garnered significant attention due to its potential applications in augmented reality, virtual reality, and autonomous driving. Traditional Transformer-based models, while powerful, often suffer from attention noise, leading to suboptimal reconstruction issues in super-resolved images. This paper introduces DIFFSSR, a novel neural network architecture designed to address these challenges. We introduce the Diff Cross Attention Block (DCAB) and the Sliding Stereo Cross-Attention Module (SSCAM) to enhance feature integration and mitigate the impact of attention noise.


ESCA: Enabling Seamless Codec Avatar Execution through Algorithm and Hardware Co-Optimization for Virtual Reality

Neural Information Processing Systems

Photorealistic Codec Avatars (PCA), which generate high-fidelity human face renderings, are increasingly being used in Virtual Reality (VR) environments to enable immersive communication and interaction through deep learning-based generative models. However, these models impose significant computational demands, making real-time inference challenging on resource-constrained VR devices such as head-mounted displays (HMDs), where latency and power efficiency are critical. To address this challenge, we propose an efficient post-training quantization (PTQ) method tailored for Codec Avatar models, enabling low-precision execution without compromising output quality. In addition, we design a custom hardware accelerator that can be integrated into the system-on-chip (SoC) of VR devices to further enhance processing efficiency. Building on these components, we introduce ESCA, a full-stack optimization framework that accelerates PCA inference on edge VR platforms. Experimental results demonstrate that ESCA boosts FovVideoVDP quality scores by up to +0.39 over the best 4-bit baseline, delivers up to 3.36 latency reduction, and sustains a rendering rate of 100 frames per second in end-to-end tests, satisfying real-time VR requirements. These results demonstrate the feasibility of deploying high-fidelity codec avatars on resource-constrained devices, opening the door to more immersive and portable VR experiences.


EgoVid-5M: A Large-Scale Video-Action Dataset for Egocentric Videos Generation

Neural Information Processing Systems

Video generation has emerged as a promising tool for world simulation, leveraging visual data to replicate real-world environments. Within this context, egocentric video generation, which centers on the human perspective, holds significant potential for enhancing applications in virtual reality, augmented reality, and gaming. However, the generation of egocentric videos presents substantial challenges due to the dynamic nature of first-person viewpoints, the intricate diversity of actions, and the complex variety of scenes encountered. Existing datasets are inadequate for addressing these challenges effectively. To bridge this gap, we present EgoVid-5M, the first high-quality dataset specifically curated for egocentric video generation. EgoVid-5M encompasses over 5 million egocentric video clips and is enriched with detailed action annotations, including fine-grained kinematic control and high-level textual descriptions. To ensure the integrity and usability of the dataset, we implement a sophisticated data cleansing pipeline designed to maintain frame consistency, action coherence, and motion smoothness under egocentric conditions. Furthermore, we introduce EgoDreamer, which is capable of generating egocentric videos driven simultaneously by action descriptions and kinematic control signals. The EgoVid-5M dataset, associated action annotations, and all data cleansing metadata will be released for the advancement of research in egocentric video generation.


Inside Anduril and Meta's quest to make smart glasses for warfare

MIT Technology Review

Inside Anduril and Meta's quest to make smart glasses for warfare It's been a year since the duo entered the US Army's troubled augmented-reality contest. Here's what it looks like so far. The defense-tech company Anduril has shared new details about the augmented-reality headset for the military it's prototyping with Meta, including a vision for ordering drone strikes via eye-tracking and voice commands. Quay Barnett, who leads the efforts as a vice president at Anduril following a career in the Army's Special Operations Command, says his fundamental goal is to optimize "the human as a weapons system." The vision is undoubtedly cyborg-inspired: Barnett wants drones and soldiers to see together, share information seamlessly, and make decisions as one. Anduril actually has two such projects in the works.


Is VR gaming now dead in the water?

PCWorld

PCWorld examines whether VR gaming is declining, highlighting challenges from Meta's failed Metaverse push and lack of compelling new content. Rising AI-driven hardware costs are making Valve's upcoming Steam Frame headset potentially unaffordable, while Apple's Vision Pro lacks gaming presence. Only Valve remains committed to VR gaming among major companies, making the technology's future uncertain despite continued development efforts. Meta is looking a lot less meta lately, reportedly pivoting from the virtual reality Quest brand and the ghost of Oculus to double down on pervert glasses. After a decade of work, Sony's VR ambitions over on the PlayStation seem to have made little progress. And I've barely heard a mention of Samsung's Galaxy XR headset--allegedly the flagship launch device for Android XR--since it arrived six months ago. While the idea that Apple is abandoning its Vision Pro headset might be overblown--the company is still actively hiring for the division--Michael Simon over at Macworld tells me the platform has basically zero gaming presence for the hardware. Hope for renewed interest in VR gaming with a big injection of Cupertino branding power has evaporated. Is virtual reality gaming, to borrow a term from, cooked?


Assessor360: Multi-sequence Network for Blind Omnidirectional Image Quality Assessment

Neural Information Processing Systems

Blind Omnidirectional Image Quality Assessment (BOIQA) aims to objectively assess the human perceptual quality of omnidirectional images (ODIs) without relying on pristine-quality image information. It is becoming more significant with the increasing advancement of virtual reality (VR) technology. However, the quality assessment of ODIs is severely hampered by the fact that the existing BOIQA pipeline lacks the modeling of the observer's browsing process. To tackle this issue, we propose a novel multi-sequence network for BOIQA called Assessor360, which is derived from the realistic multi-assessor ODI quality assessment procedure. Specifically, we propose a generalized Recursive Probability Sampling (RPS) method for the BOIQA task, combining content and details information to generate multiple pseudo viewport sequences from a given starting point.