Goto

Collaborating Authors

 Banff


DualSpec: Text-to-spatial-audio Generation via Dual-Spectrogram Guided Diffusion Model

arXiv.org Artificial Intelligence

--T ext-to-audio (TT A), which generates audio signals from textual descriptions, has received huge attention in recent years. However, recent works focused on text to monaural audio only. As we know, spatial audio provides more immersive auditory experience than monaural audio, e.g. in virtual reality. T o address this issue, we propose a text-to-spatial-audio (TTSA) generation framework named DualSpec.Specifically, it first trains variational autoencoders (V AEs) for extracting the latent acoustic representations from sound event audio. Then, given text that describes sound events and event directions, the proposed method uses the encoder of a pretrained large language model to transform the text into text features. Finally, it trains a diffusion model from the latent acoustic representations and text features for the spatial audio generation. In the inference stage, only the text description is needed to generate spatial audio. Particularly, to improve the synthesis quality and azimuth accuracy of the spatial sound events simultaneously, we propose to use two kinds of acoustic features. One is the Mel spectrograms which is good for improving the synthesis quality, and the other is the short-time Fourier transform spectrograms which is good at improving the azimuth accuracy. We provide a pipeline of constructing spatial audio dataset with text prompts, for the training of the V AEs and diffusion model. We also introduce new spatial-aware evaluation metrics to quantify the azimuth errors of the generated spatial audio recordings. Experimental results demonstrate that the proposed method can generate spatial audio with high directional and event consistency.


Combining Planning and Reinforcement Learning for Solving Relational Multiagent Domains

arXiv.org Artificial Intelligence

Multiagent Reinforcement Learning (MARL) poses significant challenges due to the exponential growth of state and action spaces and the non-stationary nature of multiagent environments. This results in notable sample inefficiency and hinders generalization across diverse tasks. The complexity is further pronounced in relational settings, where domain knowledge is crucial but often underutilized by existing MARL algorithms. To overcome these hurdles, we propose integrating relational planners as centralized controllers with efficient state abstractions and reinforcement learning. This approach proves to be sample-efficient and facilitates effective task transfer and generalization.


When Personalization Meets Reality: A Multi-Faceted Analysis of Personalized Preference Learning

arXiv.org Artificial Intelligence

While Reinforcement Learning from Human Feedback (RLHF) is widely used to align Large Language Models (LLMs) with human preferences, it typically assumes homogeneous preferences across users, overlooking diverse human values and minority viewpoints. Although personalized preference learning addresses this by tailoring separate preferences for individual users, the field lacks standardized methods to assess its effectiveness. We present a multi-faceted evaluation framework that measures not only performance but also fairness, unintended effects, and adaptability across varying levels of preference divergence. Through extensive experiments comparing eight personalization methods across three preference datasets, we demonstrate that performance differences between methods could reach 36% when users strongly disagree, and personalization can introduce up to 20% safety misalignment. These findings highlight the critical need for holistic evaluation approaches to advance the development of more effective and inclusive preference learning systems.


Steganography Beyond Space-Time With Chain of Multimodal AI Agents

arXiv.org Artificial Intelligence

Steganography is the art and science of covert writing, with a broad range of applications interwoven within the realm of cybersecurity. As artificial intelligence continues to evolve, its ability to synthesise realistic content emerges as a threat in the hands of cybercriminals who seek to manipulate and misrepresent the truth. Such synthetic content introduces a non-trivial risk of overwriting the subtle changes made for the purpose of steganography. When the signals in both the spatial and temporal domains are vulnerable to unforeseen overwriting, it calls for reflection on what can remain invariant after all. This study proposes a paradigm in steganography for audiovisual media, where messages are concealed beyond both spatial and temporal domains. A chain of multimodal agents is developed to deconstruct audiovisual content into a cover text, embed a message within the linguistic domain, and then reconstruct the audiovisual content through synchronising both aural and visual modalities with the resultant stego text. The message is encoded by biasing the word sampling process of a language generation model and decoded by analysing the probability distribution of word choices. The accuracy of message transmission is evaluated under both zero-bit and multi-bit capacity settings. Fidelity is assessed through both biometric and semantic similarities, capturing the identities of the recorded face and voice, as well as the core ideas conveyed through the media. Secrecy is examined through statistical comparisons between cover and stego texts. Robustness is tested across various scenarios, including audiovisual compression, face-swapping, voice-cloning and their combinations.


Provable Performance Bounds for Digital Twin-driven Deep Reinforcement Learning in Wireless Networks: A Novel Digital-Twin Bisimulation Metric

arXiv.org Artificial Intelligence

--Digital twin (DT)-driven deep reinforcement learning (DRL) has emerged as a promising paradigm for wireless network optimization, offering safe and efficient training environment for policy exploration. However, in theory existing methods cannot always guarantee real-world performance of DT - trained policies before actual deployment, due to the absence of a universal metric for assessing DT's ability to support reliable DRL training transferrable to physical networks. In this paper, we propose the DT bisimulation metric (DT -BSM), a novel metric based on the Wasserstein distance, to quantify the discrepancy between Markov decision processes (MDPs) in both the DT and the corresponding real-world wireless network environment. We prove that for any DT -trained policy, the sub-optimality of its performance (regret) in the real-world deployment is bounded by a weighted sum of the DT -BSM and its sub-optimality within the MDP in the DT . Then, a modified DT -BSM based on the total variation distance is also introduced to avoid the prohibitive calculation complexity of Wasserstein distance for large-scale wireless network scenarios. Further, to tackle the challenge of obtaining accurate transition probabilities of the MDP in real world for the DT -BSM calculation, we propose an empirical DT - BSM method based on statistical sampling. We prove that the empirical DT -BSM always converges to the desired theoretical one, and quantitatively establish the relationship between the required sample size and the target level of approximation accuracy. Index T erms --Digital twin, Markov decision process (MDP), deep reinforcement learning (DRL), transfer learning, bisimula-tion metric. HE long-term evolution of cellular networks, marked by growing scale, density, and heterogeneity, substantially increases the difficulty of wireless network optimization [1]. Deep reinforcement learning (DRL) emerges as a promising solution for tackling extensive state and action spaces and nonconvex optimization problems. It has been successfully applied to various network optimization tasks, such as admission control [2], resource allocation [3], node selection [4], and task offloading [5] in wireless networks. Z. Tao, W . Xu, and X. Y ou are with the National Mobile Communications Research Lab, Southeast University, Nanjing 210096, China, and also with the Pervasive Communication Research Center, Purple Mountain Laboratories, Nanjing 211111, China (email: {zhenyu tao, wxu, xhyu }@seu.edu.cn). To overcome these issues, the concept of digital twin (DT) has been introduced [7].


Sharper Concentration Inequalities for Multi-Graph Dependent Variables

arXiv.org Machine Learning

In multi-task learning (MTL) with each task involving graph-dependent data, generalization results of existing theoretical analyses yield a sub-optimal risk bound of $O(\frac{1}{\sqrt{n}})$, where $n$ is the number of training samples.This is attributed to the lack of a foundational sharper concentration inequality for multi-graph dependent random variables. To fill this gap, this paper proposes a new corresponding Bennett inequality, enabling the derivation of a sharper risk bound of $O(\frac{\log n}{n})$. Specifically, building on the proposed Bennett inequality, we propose a new corresponding Talagrand inequality for the empirical process and further develop an analytical framework of the local Rademacher complexity to enhance theoretical generalization analyses in MTL with multi-graph dependent data. Finally, we apply the theoretical advancements to applications such as Macro-AUC Optimization, demonstrating the superiority of our theoretical results over previous work, which is also corroborated by experimental results.


Actively Inferring Optimal Measurement Sequences

arXiv.org Machine Learning

Measurement of a physical quantity such as light intensity is an integral part of many reconstruction and decision scenarios but can be costly in terms of acquisition time, invasion of or damage to the environment and storage. Data minimisation and compliance with data protection laws is also an important consideration. Where there are a range of measurements that can be made, some may be more informative and compliant with the overall measurement objective than others. We develop an active sequential inference algorithm that uses the low dimensional representational latent space from a variational autoencoder (VAE) to choose which measurement to make next. Our aim is to recover high dimensional data by making as few measurements as possible. We adapt the VAE encoder to map partial data measurements on to the latent space of the complete data. The algorithm draws samples from this latent space and uses the VAE decoder to generate data conditional on the partial measurements. Estimated measurements are made on the generated data and fed back through the partial VAE encoder to the latent space where they can be evaluated prior to making a measurement. Starting from no measurements and a normal prior on the latent space, we consider alternative strategies for choosing the next measurement and updating the predictive posterior prior for the next step. The algorithm is illustrated using the Fashion MNIST dataset and a novel convolutional Hadamard pattern measurement basis. We see that useful patterns are chosen within 10 steps, leading to the convergence of the guiding generative images. Compared with using stochastic variational inference to infer the parameters of the posterior distribution for each generated data point individually, the partial VAE framework can efficiently process batches of generated data and obtains superior results with minimal measurements.


Planning, scheduling, and execution on the Moon: the CADRE technology demonstration mission

arXiv.org Artificial Intelligence

NASA's Cooperative Autonomous Distributed Robotic Exploration (CADRE) mission, slated for flight to the Moon's Reiner Gamma region in 2025/2026, is designed to demonstrate multi-agent autonomous exploration of the Lunar surface and sub-surface. A team of three robots and a base station will autonomously explore a region near the lander, collecting the data required for 3D reconstruction of the surface with no human input; and then autonomously perform distributed sensing with multi-static ground penetrating radars (GPR), driving in formation while performing coordinated radar soundings to create a map of the subsurface. At the core of CADRE's software architecture is a novel autonomous, distributed planning, scheduling, and execution (PS&E) system. The system coordinates the robots' activities, planning and executing tasks that require multiple robots' participation while ensuring that each individual robot's thermal and power resources stay within prescribed bounds, and respecting ground-prescribed sleep-wake cycles. The system uses a centralized-planning, distributed-execution paradigm, and a leader election mechanism ensures robustness to failures of individual agents. In this paper, we describe the architecture of CADRE's PS&E system; discuss its design rationale; and report on verification and validation (V&V) testing of the system on CADRE's hardware in preparation for deployment on the Moon.


SWA-LDM: Toward Stealthy Watermarks for Latent Diffusion Models

arXiv.org Artificial Intelligence

In the rapidly evolving landscape of image generation, Latent Diffusion Models (LDMs) have emerged as powerful tools, enabling the creation of highly realistic images. However, this advancement raises significant concerns regarding copyright infringement and the potential misuse of generated content. Current watermarking techniques employed in LDMs often embed constant signals to the generated images that compromise their stealthiness, making them vulnerable to detection by malicious attackers. In this paper, we introduce SWA-LDM, a novel approach that enhances watermarking by randomizing the embedding process, effectively eliminating detectable patterns while preserving image quality and robustness. Our proposed watermark presence attack reveals the inherent vulnerabilities of existing latent-based watermarking methods, demonstrating how easily these can be exposed. Through comprehensive experiments, we validate that SWA-LDM not only fortifies watermark stealthiness but also maintains competitive performance in watermark robustness and visual fidelity. This work represents a pivotal step towards securing LDM-generated images against unauthorized use, ensuring both copyright protection and content integrity in an era where digital image authenticity is paramount.


Accuracy and Robustness of Weight-Balancing Methods for Training PINNs

arXiv.org Artificial Intelligence

However, like any deep learning methods, PINNs inherit stochastic properties from their underlying architecture, which can lead to challenges in convergence, sensitivity to initial conditions, and variability in performance [2]. These issues pose barriers to achieving robust and efficient training, particularly for large-scale or complex systems. Deep learning research has long recognized the impact of stochasticity on training outcomes, with factors such as parameter initialization, optimizer design, and data representation playing critical roles. For instance, the seminal work of Glorot and Bengio in [3] introduced that there are better initialization strategies than others, especially for large and deep neural networks. Based on this observation, they improved initialization schemes to address issues of vanishing or exploding gradients, significantly enhancing the training of deep neural networks. Despite these advances, PINNs are different from other classical deep learning algorithms because they consider gradients information and remain therefore susceptible to instabilities and inefficiencies during training [4, 5]. Multiple attempts have been made to improve PINNs' accuracy and efficiency, including pretraining [6, 7], reformulations of the underlying mathematical problem [8, 9], novel architectures [10, 11], new learning paradigms such as meta-learning and curriculum learning [12, 13], and loss reweighting techniques to balance competing objectives [14, 15, 16]. Because of the lack of clear metrics, all these techniques are not strictly compared, limiting their practical implementations. To address these challenges, we propose a probabilistic framework for improving the convergence properties of PINNs.