Goto

Collaborating Authors

 Wang, Pengyu


CleanMel: Mel-Spectrogram Enhancement for Improving Both Speech Quality and ASR

arXiv.org Artificial Intelligence

In this work, we propose CleanMel, a single-channel Mel-spectrogram denoising and dereverberation network for improving both speech quality and automatic speech recognition (ASR) performance. The proposed network takes as input the noisy and reverberant microphone recording and predicts the corresponding clean Mel-spectrogram. The enhanced Mel-spectrogram can be either transformed to speech waveform with a neural vocoder or directly used for ASR. The proposed network is composed of interleaved cross-band and narrow-band processing in the Mel-frequency domain, for learning the full-band spectral pattern and the narrow-band properties of signals, respectively. Compared to linear-frequency domain or time-domain speech enhancement, the key advantage of Mel-spectrogram enhancement is that Mel-frequency presents speech in a more compact way and thus is easier to learn, which will benefit both speech quality and ASR. Experimental results on four English and one Chinese datasets demonstrate a significant improvement in both speech quality and ASR performance achieved by the proposed model. Code and audio examples of our model are available online in https://audio.westlake.edu.cn/Research/CleanMel.html.


Optimal Actuator Attacks on Autonomous Vehicles Using Reinforcement Learning

arXiv.org Artificial Intelligence

Recently, there has been a growing focus on the security and safety issues associated with these vehicles. Due to focus on the stealthiness of the attacks, which is crucial given the high reliance of autonomous vehicles on software and that modern autonomous vehicle systems are equipped with communication systems, they are vulnerable to different advanced attack detectors. Second, the design of their secure types of attacks, as shown in Figure 1, which may lead to severe controller is based on specific FDI attack and training data accidents [3]. Attacks on autonomous vehicles are typically obtained through RL, limiting its generalization to different categorized into those targeting actuators and those targeting types of attacks. These limitations motivate our research.


VINP: Variational Bayesian Inference with Neural Speech Prior for Joint ASR-Effective Speech Dereverberation and Blind RIR Identification

arXiv.org Artificial Intelligence

Reverberant speech, denoting the speech signal degraded by the process of reverberation, contains crucial knowledge of both anechoic source speech and room impulse response (RIR). This work proposes a variational Bayesian inference (VBI) framework with neural speech prior (VINP) for joint speech dereverberation and blind RIR identification. In VINP, a probabilistic signal model is constructed in the time-frequency (T-F) domain based on convolution transfer function (CTF) approximation. For the first time, we propose using an arbitrary discriminative dereverberation deep neural network (DNN) to predict the prior distribution of anechoic speech within a probabilistic model. By integrating both reverberant speech and the anechoic speech prior, VINP yields the maximum a posteriori (MAP) and maximum likelihood (ML) estimations of the anechoic speech spectrum and CTF filter, respectively. After simple transformations, the waveforms of anechoic speech and RIR are estimated. Moreover, VINP is effective for automatic speech recognition (ASR) systems, which sets it apart from most deep learning (DL)-based single-channel dereverberation approaches. Experiments on single-channel speech dereverberation demonstrate that VINP reaches an advanced level in most metrics related to human perception and displays unquestionable state-of-the-art (SOTA) performance in ASR-related metrics. For blind RIR identification, experiments indicate that VINP attains the SOTA level in blind estimation of reverberation time at 60 dB (RT60) and direct-to-reverberation ratio (DRR). Codes and audio samples are available online.


FedAGHN: Personalized Federated Learning with Attentive Graph HyperNetworks

arXiv.org Artificial Intelligence

Personalized Federated Learning (PFL) aims to address the statistical heterogeneity of data across clients by learning the personalized model for each client. Among various PFL approaches, the personalized aggregation-based approach conducts parameter aggregation in the server-side aggregation phase to generate personalized models, and focuses on learning appropriate collaborative relationships among clients for aggregation. However, the collaborative relationships vary in different scenarios and even at different stages of the FL process. To this end, we propose Personalized Federated Learning with Attentive Graph HyperNetworks (FedAGHN), which employs Attentive Graph HyperNetworks (AGHNs) to dynamically capture fine-grained collaborative relationships and generate client-specific personalized initial models. Specifically, AGHNs empower graphs to explicitly model the client-specific collaborative relationships, construct collaboration graphs, and introduce tunable attentive mechanism to derive the collaboration weights, so that the personalized initial models can be obtained by aggregating parameters over the collaboration graphs. Extensive experiments can demonstrate the superiority of FedAGHN. Moreover, a series of visualizations are presented to explore the effectiveness of collaboration graphs learned by FedAGHN.


Learning-based Detection of GPS Spoofing Attack for Quadrotors

arXiv.org Artificial Intelligence

Safety-critical cyber-physical systems (CPS), such as quadrotor UAVs, are particularly prone to cyber attacks, which can result in significant consequences if not detected promptly and accurately. During outdoor operations, the nonlinear dynamics of UAV systems, combined with non-Gaussian noise, pose challenges to the effectiveness of conventional statistical and machine learning methods. To overcome these limitations, we present QUADFormer, an advanced attack detection framework for quadrotor UAVs leveraging a transformer-based architecture. This framework features a residue generator that produces sequences sensitive to anomalies, which are then analyzed by the transformer to capture statistical patterns for detection and classification. Furthermore, an alert mechanism ensures UAVs can operate safely even when under attack. Extensive simulations and experimental evaluations highlight that QUADFormer outperforms existing state-of-the-art techniques in detection accuracy.


ROLO-SLAM: Rotation-Optimized LiDAR-Only SLAM in Uneven Terrain with Ground Vehicle

arXiv.org Artificial Intelligence

LiDAR-based SLAM is recognized as one effective method to offer localization guidance in rough environments. However, off-the-shelf LiDAR-based SLAM methods suffer from significant pose estimation drifts, particularly components relevant to the vertical direction, when passing to uneven terrains. This deficiency typically leads to a conspicuously distorted global map. In this article, a LiDAR-based SLAM method is presented to improve the accuracy of pose estimations for ground vehicles in rough terrains, which is termed Rotation-Optimized LiDAR-Only (ROLO) SLAM. The method exploits a forward location prediction to coarsely eliminate the location difference of consecutive scans, thereby enabling separate and accurate determination of the location and orientation at the front-end. Furthermore, we adopt a parallel-capable spatial voxelization for correspondence-matching. We develop a spherical alignment-guided rotation registration within each voxel to estimate the rotation of vehicle. By incorporating geometric alignment, we introduce the motion constraint into the optimization formulation to enhance the rapid and effective estimation of LiDAR's translation. Subsequently, we extract several keyframes to construct the submap and exploit an alignment from the current scan to the submap for precise pose estimation. Meanwhile, a global-scale factor graph is established to aid in the reduction of cumulative errors. In various scenes, diverse experiments have been conducted to evaluate our method. The results demonstrate that ROLO-SLAM excels in pose estimation of ground vehicles and outperforms existing state-of-the-art LiDAR SLAM frameworks.


ChemDFM-X: Towards Large Multimodal Model for Chemistry

arXiv.org Artificial Intelligence

Rapid developments of AI tools are expected to offer unprecedented assistance to the research of natural science including chemistry. However, neither existing unimodal task-specific specialist models nor emerging general large multimodal models (LMM) can cover the wide range of chemical data modality and task categories. To address the real demands of chemists, a cross-modal Chemical General Intelligence (CGI) system, which serves as a truly practical and useful research assistant utilizing the great potential of LMMs, is in great need. In this work, we introduce the first Cross-modal Dialogue Foundation Model for Chemistry (ChemDFM-X). Diverse multimodal data are generated from an initial modality by approximate calculations and task-specific model predictions. This strategy creates sufficient chemical training corpora, while significantly reducing excessive expense, resulting in an instruction-tuning dataset containing 7.6M data. After instruction finetuning, ChemDFM-X is evaluated on extensive experiments of different chemical tasks with various data modalities. The results demonstrate the capacity of ChemDFM-X for multimodal and inter-modal knowledge comprehension. ChemDFM-X marks a significant milestone toward aligning all modalities in chemistry, a step closer to CGI.


LongSafetyBench: Long-Context LLMs Struggle with Safety Issues

arXiv.org Artificial Intelligence

WARNING: This paper contains unsafe content. With the development of large language models (LLMs), the sequence length of these models continues to increase, drawing significant attention to long-context language models. However, the evaluation of these models has been primarily limited to their capabilities, with a lack of research focusing on their safety. Existing work, such as ManyShotJailbreak, has to some extent demonstrated that longcontext language models can exhibit safety concerns. However, the methods used are limited and lack comprehensiveness. In response, we introduce LongSafety-Bench, the first benchmark designed to objectively and comprehensively evaluate the safety of long-context models. LongSafetyBench consists of 10 task categories, with an average length of 41,889 words. After testing eight long-context language models on LongSafetyBench, we found that existing models generally exhibit insufficient safety capabilities. The proportion of safe responses from most mainstream long-context LLMs is below 50%. Moreover, models' safety performance in long-context scenarios does not always align with that in short-context scenarios. Further investigation revealed that long-context models tend to overlook harmful content within lengthy texts. We also proposed a simple yet effective solution, allowing open-source models to achieve performance comparable to that of top-tier closed-source models. We believe that LongSafetyBench can serve as a valuable benchmark for evaluating the safety capabilities of long-context language models. We hope that our work will encourage the broader community to pay attention to the safety of long-context models and contribute to the development of solutions to improve the safety of long-context LLMs. Recently, thanks to more advanced model architectures (Xiao et al., 2024b;a; Liu et al., 2024a) and expanded position encoding techniques (Su et al., 2023; Liu et al., 2024b), the context length of language models has been extended significantly (Achiam et al., 2023; Reid et al., 2024). In the foreseeable future, as language models continue to evolve and tackle increasingly complex problems, the demand for handling longer contexts is expected to grow accordingly. We anticipate that long-context language models will become mainstream. Previous research on long-context language models, such as LongBench (Bai et al., 2024), L-Eval (An et al., 2023), and RULER (Hsieh et al., 2024), has typically focused on their capabilities, while neglecting to address their safety. In short-context scenarios, the safety issues of language models have already been extensively studied.(Zhang Illegal Activities, Misinformation Harm, Offensiveness The question is composed of a long content and Bias.


BitStack: Fine-Grained Size Control for Compressed Large Language Models in Variable Memory Environments

arXiv.org Artificial Intelligence

Large language models (LLMs) have revolutionized numerous applications, yet their deployment remains challenged by memory constraints on local devices. While scaling laws have enhanced LLM capabilities, the primary bottleneck has shifted from capability to availability, emphasizing the need for efficient memory management. Traditional compression methods, such as quantization, often require predefined compression ratios and separate compression processes for each setting, complicating deployment in variable memory environments. In this paper, we introduce BitStack, a novel, training-free weight compression approach that enables megabyte-level trade-offs between memory usage and model performance. By leveraging weight decomposition, BitStack can dynamically adjust the model size with minimal transmission between running memory and storage devices. These blocks are sorted and stacked in storage as basic transmission units, with different quantities loaded based on current memory availability. Extensive experiments across a wide range of tasks demonstrate that, despite offering finegrained size control, BitStack consistently matches or surpasses strong quantization baselines, particularly at extreme compression ratios. To the best of our knowledge, this is the first decomposition-based method that effectively bridges the gap to practical compression techniques like quantization. Figure 1: BitStack enables LLMs to dynamically adjust their size in variable memory environments (a) at a megabyte-level, while still matching or surpassing the performance of practical compression methods such as GPTQ (Frantar et al., 2022) and AWQ (Lin et al., 2024) with the same memory footprint(b). Large language models (LLMs) have demonstrated superior performance on various benchmarks (Achiam et al., 2023; Dubey et al., 2024) and are increasingly serving as practical assistants in people's daily lives, such as general language assistants (OpenAI, 2024; Google, 2024; Anthropic, 2024), search engines (Perplexity.AI, 2024), and code assistants (GitHub, 2024). With the blessing of scaling laws (Kaplan et al., 2020), LLMs are becoming more powerful as their sizes expand, and the main bottleneck for deploying task-capable LLMs has shifted from their capability to their availability.


MetaAlign: Align Large Language Models with Diverse Preferences during Inference Time

arXiv.org Artificial Intelligence

Large Language Models (LLMs) acquire extensive knowledge and remarkable abilities from extensive text corpora, making them powerful tools for various applications. To make LLMs more usable, aligning them with human preferences is essential. Existing alignment techniques, such as Reinforcement Learning from Human Feedback (RLHF) and Direct Preference Optimization (DPO), typically embed predefined preferences directly within the model's parameters. These methods, however, often result in a static alignment that can not account for the diversity of human preferences in practical applications. In response to this challenge, we propose an effective method, \textbf{MetaAlign}, which aims to help LLMs dynamically align with various explicit or implicit preferences specified at inference time. Experimental results show that LLMs optimized on our meticulously constructed MetaAlign Dataset can effectively align with any preferences specified at the inference stage, validating the feasibility of MetaAlign. We hope that our work can provide some insights into the alignment of language models.