Markham, Andrew
Target Speaker Extraction through Comparing Noisy Positive and Negative Audio Enrollments
Xu, Shitong, Yang, Yiyuan, Trigoni, Niki, Markham, Andrew
Target speaker extraction focuses on isolating a specific speaker's voice from an audio mixture containing multiple speakers. To provide information about the target speaker's identity, prior works have utilized clean audio examples as conditioning inputs. However, such clean audio examples are not always readily available (e.g. It is impractical to obtain a clean audio example of a stranger's voice at a cocktail party without stepping away from the noisy environment). Limited prior research has explored extracting the target speaker's characteristics from noisy audio examples, which may include overlapping speech from disturbing speakers. In this work, we focus on target speaker extraction when multiple speakers are present during the enrollment stage, through leveraging differences between audio segments where the target speakers are speaking (Positive Enrollments) and segments where they are not (Negative Enrollments). Experiments show the effectiveness of our model architecture and the dedicated pretraining method for the proposed task. Our method achieves state-of-the-art performance in the proposed application settings and demonstrates strong generalizability across challenging and realistic scenarios.
RiTTA: Modeling Event Relations in Text-to-Audio Generation
He, Yuhang, Jain, Yash, Liu, Xubo, Markham, Andrew, Vineet, Vibhav
Despite significant advancements in Text-to-Audio (TTA) generation models achieving high-fidelity audio with fine-grained context understanding, they struggle to model the relations between audio events described in the input text. However, previous TTA methods have not systematically explored audio event relation modeling, nor have they proposed frameworks to enhance this capability. In this work, we systematically study audio event relation modeling in TTA generation models. We first establish a benchmark for this task by: 1. proposing a comprehensive relation corpus covering all potential relations in real-world scenarios; 2. introducing a new audio event corpus encompassing commonly heard audios; and 3. proposing new evaluation metrics to assess audio event relation modeling from various perspectives. Furthermore, we propose a finetuning framework to enhance existing TTA models ability to model audio events relation. Code is available at: https://github.com/yuhanghe01/RiTTA
Towards Multi-Modal Animal Pose Estimation: A Survey and In-Depth Analysis
Deng, Qianyi, Deb, Oishi, Patel, Amir, Rupprecht, Christian, Torr, Philip, Trigoni, Niki, Markham, Andrew
Animal pose estimation (APE) aims to locate the animal body parts using a diverse array of sensor and modality inputs (e.g. RGB cameras, LiDAR, infrared, IMU, acoustic and language cues), which is crucial for research across neuroscience, biomechanics, and veterinary medicine. By evaluating 176 papers since 2011, APE methods are categorised by their input sensor and modality types, output forms, learning paradigms, experimental setup, and application domains, presenting detailed analyses of current trends, challenges, and future directions in single- and multi-modality APE systems. The analysis also highlights the transition between human and animal pose estimation, and how innovations in APE can reciprocally enrich human pose estimation and the broader machine learning paradigm. Additionally, 2D and 3D APE datasets and evaluation metrics based on different sensors and modalities are provided. A regularly updated project page is provided here: https://github.com/ChennyDeng/MM-APE.
HyperspectralViTs: General Hyperspectral Models for On-board Remote Sensing
Růžička, Vít, Markham, Andrew
On-board processing of hyperspectral data with machine learning models would enable unprecedented amount of autonomy for a wide range of tasks, for example methane detection or mineral identification. This can enable early warning system and could allow new capabilities such as automated scheduling across constellations of satellites. Classical methods suffer from high false positive rates and previous deep learning models exhibit prohibitive computational requirements. We propose fast and accurate machine learning architectures which support end-to-end training with data of high spectral dimension without relying on hand-crafted products or spectral band compression preprocessing. We evaluate our models on two tasks related to hyperspectral data processing. With our proposed general architectures, we improve the F1 score of the previous methane detection state-of-the-art models by 27% on a newly created synthetic dataset and by 13% on the previously released large benchmark dataset. We also demonstrate that training models on the synthetic dataset improves performance of models finetuned on the dataset of real events by 6.9% in F1 score in contrast with training from scratch. On a newly created dataset for mineral identification, our models provide 3.5% improvement in the F1 score in contrast to the default versions of the models. With our proposed models we improve the inference speed by 85% in contrast to previous classical and deep learning approaches by removing the dependency on classically computed features. With our architecture, one capture from the EMIT sensor can be processed within 30 seconds on realistic proxy of the ION-SCV 004 satellite.
SPEAR: Receiver-to-Receiver Acoustic Neural Warping Field
He, Yuhang, Xu, Shitong, Zhong, Jia-Xing, Shin, Sangyun, Trigoni, Niki, Markham, Andrew
Unlike traditional source-to-receiver modelling methods that require prior space acoustic properties knowledge to rigorously model audio propagation from source to receiver, we propose to predict by warping the spatial acoustic effects from one reference receiver position to another target receiver position, so that the warped audio essentially accommodates all spatial acoustic effects belonging to the target position. SPEAR can be trained in a data much more readily accessible manner, in which we simply ask two robots to independently record spatial audio at different positions. We further theoretically prove the universal existence of the warping field if and only if one audio source presents. Three physical principles are incorporated to guide SPEAR network design, leading to the learned warping field physically meaningful. We demonstrate SPEAR superiority on both synthetic, photo-realistic and real-world dataset, showing the huge potential of SPEAR to various down-stream robotic tasks.
Pre-training Feature Guided Diffusion Model for Speech Enhancement
Yang, Yiyuan, Trigoni, Niki, Markham, Andrew
Speech enhancement significantly improves the clarity and intelligibility of speech in noisy environments, improving communication and listening experiences. In this paper, we introduce a novel pretraining feature-guided diffusion model tailored for efficient speech enhancement, addressing the limitations of existing discriminative and generative models. By integrating spectral features into a variational autoencoder (VAE) and leveraging pre-trained features for guidance during the reverse process, coupled with the utilization of the deterministic discrete integration method (DDIM) to streamline sampling steps, our model improves efficiency and speech enhancement quality. Demonstrating state-of-the-art results on two public datasets with different SNRs, our model outshines other baselines in efficiency and robustness. The proposed method not only optimizes performance but also enhances practical deployment capabilities, without increasing computational demands.
Gen4Gen: Generative Data Pipeline for Generative Multi-Concept Composition
Yeh, Chun-Hsiao, Cheng, Ta-Ying, Hsieh, He-Yen, Lin, Chuan-En, Ma, Yi, Markham, Andrew, Trigoni, Niki, Kung, H. T., Chen, Yubei
Recent text-to-image diffusion models are able to learn and synthesize images containing novel, personalized concepts (e.g., their own pets or specific items) with just a few examples for training. This paper tackles two interconnected issues within this realm of personalizing text-to-image diffusion models. First, current personalization techniques fail to reliably extend to multiple concepts -- we hypothesize this to be due to the mismatch between complex scenes and simple text descriptions in the pre-training dataset (e.g., LAION). Second, given an image containing multiple personalized concepts, there lacks a holistic metric that evaluates performance on not just the degree of resemblance of personalized concepts, but also whether all concepts are present in the image and whether the image accurately reflects the overall text description. To address these issues, we introduce Gen4Gen, a semi-automated dataset creation pipeline utilizing generative models to combine personalized concepts into complex compositions along with text-descriptions. Using this, we create a dataset called MyCanvas, that can be used to benchmark the task of multi-concept personalization. In addition, we design a comprehensive metric comprising two scores (CP-CLIP and TI-CLIP) for better quantifying the performance of multi-concept, personalized text-to-image diffusion methods. We provide a simple baseline built on top of Custom Diffusion with empirical prompting strategies for future researchers to evaluate on MyCanvas. We show that by improving data quality and prompting strategies, we can significantly increase multi-concept personalized image generation quality, without requiring any modifications to model architecture or training algorithms.
Spherical Mask: Coarse-to-Fine 3D Point Cloud Instance Segmentation with Spherical Representation
Shin, Sangyun, Zhou, Kaichen, Vankadari, Madhu, Markham, Andrew, Trigoni, Niki
Coarse-to-fine 3D instance segmentation methods show weak performances compared to recent Grouping-based, Kernel-based and Transformer-based methods. We argue that this is due to two limitations: 1) Instance size overestimation by axis-aligned bounding box(AABB) 2) False negative error accumulation from inaccurate box to the refinement phase. In this work, we introduce Spherical Mask, a novel coarse-to-fine approach based on spherical representation, overcoming those two limitations with several benefits. Specifically, our coarse detection estimates each instance with a 3D polygon using a center and radial distance predictions, which avoids excessive size estimation of AABB. To cut the error propagation in the existing coarse-to-fine approaches, we virtually migrate points based on the polygon, allowing all foreground points, including false negatives, to be refined. During inference, the proposal and point migration modules run in parallel and are assembled to form binary masks of instances. We also introduce two margin-based losses for the point migration to enforce corrections for the false positives/negatives and cohesion of foreground points, significantly improving the performance. Experimental results from three datasets, such as ScanNetV2, S3DIS, and STPLS3D, show that our proposed method outperforms existing works, demonstrating the effectiveness of the new instance representation with spherical coordinates.
Multi-body SE(3) Equivariance for Unsupervised Rigid Segmentation and Motion Estimation
Zhong, Jia-Xing, Cheng, Ta-Ying, He, Yuhang, Lu, Kai, Zhou, Kaichen, Markham, Andrew, Trigoni, Niki
A truly generalizable approach to rigid segmentation and motion estimation is fundamental to 3D understanding of articulated objects and moving scenes. In view of the closely intertwined relationship between segmentation and motion estimates, we present an SE(3) equivariant architecture and a training strategy to tackle this task in an unsupervised manner. Our architecture is composed of two interconnected, lightweight heads. These heads predict segmentation masks using point-level invariant features and estimate motion from SE(3) equivariant features, all without the need for category information. Our training strategy is unified and can be implemented online, which jointly optimizes the predicted segmentation and motion by leveraging the interrelationships among scene flow, segmentation mask, and rigid transformations. We conduct experiments on four datasets to demonstrate the superiority of our method. The results show that our method excels in both model performance and computational efficiency, with only 0.25M parameters and 0.92G FLOPs. To the best of our knowledge, this is the first work designed for category-agnostic part-level SE(3) equivariance in dynamic point clouds.
Fast model inference and training on-board of Satellites
Růžička, Vít, Mateo-García, Gonzalo, Bridges, Chris, Brunskill, Chris, Purcell, Cormac, Longépé, Nicolas, Markham, Andrew
Artificial intelligence onboard satellites has the potential to reduce data transmission requirements, enable real-time decision-making and collaboration within constellations. This study deploys a lightweight foundational model called RaVAEn on D-Orbit's ION SCV004 satellite. RaVAEn is a variational auto-encoder (VAE) that generates compressed latent vectors from small image tiles, enabling several downstream tasks. In this work we demonstrate the reliable use of RaVAEn onboard a satellite, achieving an encoding time of 0.110s for tiles of a 4.8x4.8 km$^2$ area. In addition, we showcase fast few-shot training onboard a satellite using the latent representation of data. We compare the deployment of the model on the on-board CPU and on the available Myriad vision processing unit (VPU) accelerator. To our knowledge, this work shows for the first time the deployment of a multi-task model on-board a CubeSat and the on-board training of a machine learning model.