Goto

Collaborating Authors

 azimuth


Deep Convolutional Inverse Graphics Network

Tejas D. Kulkarni, William F. Whitney, Pushmeet Kohli, Josh Tenenbaum

Neural Information Processing Systems

This paper presents the Deep Convolution Inverse Graphics Network (DC-IGN), a model that aims to learn an interpretable representation of images, disentangled with respect to three-dimensional scene structure and viewing transformations such as depth rotations and lighting variations. The DC-IGN model is composed of multiple layers of convolution and de-convolution operators and is trained using the Stochastic Gradient Variational Bayes (SGVB) algorithm [10]. We propose a training procedure to encourage neurons in the graphics code layer to represent a specific transformation (e.g.


Generating Moving 3D Soundscapes with Latent Diffusion Models

Templin, Christian, Zhu, Yanda, Wang, Hao

arXiv.org Artificial Intelligence

Spatial audio has become central to immersive applications such as VR/AR, cinema, and music. Existing generative audio models are largely limited to mono or stereo formats and cannot capture the full 3D localization cues available in first-order Ambisonics (FOA). Recent FOA models extend text-to-audio generation but remain restricted to static sources. In this work, we introduce SonicMotion, the first end-to-end latent diffusion framework capable of generating FOA audio with explicit control over moving sound sources. SonicMotion is implemented in two variations: 1) a descriptive model conditioned on natural language prompts, and 2) a parametric model conditioned on both text and spatial trajectory parameters for higher precision. To support training and evaluation, we construct a new dataset of over one million simulated FOA caption pairs that include both static and dynamic sources with annotated azimuth, elevation, and motion attributes. Experiments show that SonicMotion achieves state-of-the-art semantic alignment and perceptual quality comparable to leading text-to-audio systems, while uniquely attaining low spatial localization error.


Export Reviews, Discussions, Author Feedback and Meta-Reviews

Neural Information Processing Systems

Summary: the paper proposes a CNN for learning explicit image representations as an inverse graphics problem. The image representation has interpretable explicit representations, in particular pose angles and lighting angles, along with implicit representations (texture, appearance). This is done in an autoencoder framework with reconstruction error. To make a particular latent dimension focus on one aspect (e.g. Experiments on two datasets showing reconstructions of a 3D object at varying poses and illumination directions.


ImmerseDiffusion: A Generative Spatial Audio Latent Diffusion Model

Heydari, Mojtaba, Souden, Mehrez, Conejo, Bruno, Atkins, Joshua

arXiv.org Artificial Intelligence

We introduce ImmerseDiffusion, an end-to-end generative audio model that produces 3D immersive soundscapes conditioned on the spatial, temporal, and environmental conditions of sound objects. ImmerseDiffusion is trained to generate first-order ambisonics (FOA) audio, which is a conventional spatial audio format comprising four channels that can be rendered to multichannel spatial output. The proposed generative system is composed of a spatial audio codec that maps FOA audio to latent components, a latent diffusion model trained based on various user input types, namely, text prompts, spatial, temporal and environmental acoustic parameters, and optionally a spatial audio and text encoder trained in a Contrastive Language and Audio Pretraining (CLAP) style. We propose metrics to evaluate the quality and spatial adherence of the generated spatial audio. Finally, we assess the model performance in terms of generation quality and spatial conformance, comparing the two proposed modes: ``descriptive", which uses spatial text prompts) and ``parametric", which uses non-spatial text prompts and spatial parameters. Our evaluations demonstrate promising results that are consistent with the user conditions and reflect reliable spatial fidelity.


Are Doppler Velocity Measurements Useful for Spinning Radar Odometry?

Lisus, Daniil, Burnett, Keenan, Yoon, David J., Poulton, Richard, Marshall, John, Barfoot, Timothy D.

arXiv.org Artificial Intelligence

Spinning, frequency-modulated continuous-wave (FMCW) radars with 360 degree coverage have been gaining popularity for autonomous-vehicle navigation. However, unlike 'fixed' automotive radar, commercially available spinning radar systems typically do not produce radial velocities due to the lack of repeated measurements in the same direction and the fundamental hardware setup. To make these radial velocities observable, we modified the firmware of a commercial spinning radar to use triangular frequency modulation. In this paper, we develop a novel way to use this modulation to extract radial Doppler velocity measurements from single raw radar intensity scans without any required data association. We show that these noisy, error-prone measurements contain enough information to provide good ego-velocity estimates, and incorporate these estimates into different modern odometry pipelines. We extensively evaluate the pipelines on over 110 km of driving data in progressively more geometrically challenging autonomous-driving environments. We show that Doppler velocity measurements improve odometry in well-defined geometric conditions and enable it to continue functioning even in severely geometrically degenerate environments, such as long tunnels.


Improving Chinese Character Representation with Formation Tree

Hong, Yang, Li, Yinfei, Qiao, Xiaojun, Li, Rui, Zhang, Junsong

arXiv.org Artificial Intelligence

Learning effective representations for Chinese characters presents unique challenges, primarily due to the vast number of characters and their continuous growth, which requires models to handle an expanding category space. Additionally, the inherent sparsity of character usage complicates the generalization of learned representations. Prior research has explored radical-based sequences to overcome these issues, achieving progress in recognizing unseen characters. However, these approaches fail to fully exploit the inherent tree structure of such sequences. To address these limitations and leverage established data properties, we propose Formation Tree-CLIP (FT-CLIP). This model utilizes formation trees to represent characters and incorporates a dedicated tree encoder, significantly improving performance in both seen and unseen character recognition tasks. We further introduce masking for to both character images and tree nodes, enabling efficient and effective training. This approach accelerates training significantly (by a factor of 2 or more) while enhancing accuracy. Extensive experiments show that processing characters through formation trees aligns better with their inherent properties than direct sequential methods, significantly enhancing the generality and usability of the representations.


Deep Convolutional Inverse Graphics Network, William F. Whitney* 2, Joshua B. Tenenbaum

Neural Information Processing Systems

This paper presents the Deep Convolution Inverse Graphics Network (DC-IGN), a model that aims to learn an interpretable representation of images, disentangled with respect to three-dimensional scene structure and viewing transformations such as depth rotations and lighting variations. The DC-IGN model is composed of multiple layers of convolution and de-convolution operators and is trained using the Stochastic Gradient Variational Bayes (SGVB) algorithm [10]. We propose a training procedure to encourage neurons in the graphics code layer to represent a specific transformation (e.g.


Simulating Nighttime Visible Satellite Imagery of Tropical Cyclones Using Conditional Generative Adversarial Networks

Yao, Jinghuai, Du, Puyuan, Zhao, Yucheng, Wang, Yubo

arXiv.org Artificial Intelligence

Visible (VIS) imagery of satellites has various important applications in meteorology, including monitoring Tropical Cyclones (TCs). However, it is unavailable at night because of the lack of sunlight. This study presents a Conditional Generative Adversarial Networks (CGAN) model that generates highly accurate nighttime visible reflectance using infrared (IR) bands and sunlight direction parameters as input. The model was trained and validated using target area observations of the Advanced Himawari Imager (AHI) in the daytime. This study also presents the first nighttime model validation using the Day/Night Band (DNB) of the Visible/Infrared Imager Radiometer Suite (VIIRS). The daytime statistical results of the Structural Similarity Index Measure (SSIM), Peak Signal-to-Noise Ratio (PSNR), Root Mean Square Error (RMSE), Correlation Coefficient (CC), and Bias are 0.885, 28.3, 0.0428, 0.984, and -0.0016 respectively, completely surpassing the model performance of previous studies. The nighttime statistical results of SSIM, PSNR, RMSE, and CC are 0.821, 24.4, 0.0643, and 0.969 respectively, which are slightly negatively impacted by the parallax between satellites. We performed full-disk model validation which proves our model could also be readily applied in the tropical ocean without TCs in the northern hemisphere. This model contributes to the nighttime monitoring of meteorological phenomena by providing accurate AI-generated visible imagery with adjustable virtual sunlight directions.


Doppler-aware Odometry from FMCW Scanning Radar

Rennie, Fraser, Williams, David, Newman, Paul, De Martini, Daniele

arXiv.org Artificial Intelligence

Abstract-- This work explores Doppler information from a millimetre-Wave (mm-W) Frequency-Modulated Continuous-Wave (FMCW) scanning radar to make odometry estimation more robust and accurate. Firstly, doppler information is added to the scan masking process to enhance correlative scan matching. Secondly, we train a Neural Network (NN) for regressing forward velocity directly from a single radar scan; we fuse this estimate with the correlative scan matching estimate and show improved robustness to bad estimates caused by challenging environment geometries, e.g. We test our method with a novel custom dataset which is released with this work at https://ori.ox.ac.uk/publications/datasets. Index Terms-- radar odometry, doppler, navigation, dataset As considered deployment scenarios become more challenging, the detection methods and the sensors collecting data about a vehicle's surroundings must Figure 1: Radar scan from the RDD dataset. Currently, the primary sensors used by autonomous two regions extracted show the "zig-zag" pattern caused by vehicles are cameras and LiDAR: while these traditional the alternating modulation patterns - in conjunction with the sensors may perform adequately under favourable conditions, ego-vehicle speed.


Decentralized shape formation and force-based interactive formation control in robot swarms

S, Akshaya C, Soma, Karthik, B, Visweswaran, Ravichander, Aditya, PM, Venkata Nagarjun

arXiv.org Artificial Intelligence

Swarm robotic systems utilize collective behaviour to achieve goals that might be too complex for a lone entity, but become attainable with localized communication and collective decision making. In this paper, a behaviour-based distributed approach to shape formation is proposed. Flocking into strategic formations is observed in migratory birds and fish to avoid predators and also for energy conservation. The formation is maintained throughout long periods without collapsing and is advantageous for communicating within the flock. Similar behaviour can be deployed in multi-agent systems to enhance coordination within the swarm. Existing methods for formation control are either dependent on the size and geometry of the formation or rely on maintaining the formation with a single reference in the swarm (the leader). These methods are not resilient to failure and involve a high degree of deformation upon obstacle encounter before the shape is recovered again. To improve the performance, artificial force-based interaction amongst the entities of the swarm to maintain shape integrity while encountering obstacles is elucidated.