Goto

Collaborating Authors

 Zhang, Pengfei


HiCMamba: Enhancing Hi-C Resolution and Identifying 3D Genome Structures with State Space Modeling

arXiv.org Artificial Intelligence

However, high sequencing costs and technical challenges often result in Hi-C data with limited coverage, leading to imprecise estimates of chromatin interaction frequencies. To address this issue, we present a novel deep learning-based method HiCMamba to enhance the resolution of Hi-C contact maps using a state space model. We adopt the UNet-based auto-encoder architecture to stack the proposed holistic scan block, enabling the perception of both global and local receptive fields at multiple scales. Experimental results demonstrate that HiCMamba outperforms state-of-the-art methods while significantly reducing computational resources. Furthermore, the 3D genome structures, including topologically associating domains (TADs) and loops, identified in the contact maps recovered by HiCMamba are validated through associated epigenomic features. Our work demonstrates the potential of a state space model as foundational frameworks in the field of Hi-C resolution enhancement.


Contextual Gesture: Co-Speech Gesture Video Generation through Context-aware Gesture Representation

arXiv.org Artificial Intelligence

Co-speech gesture generation is crucial for creating lifelike avatars and enhancing human-computer interactions by synchronizing gestures with speech. Despite recent advancements, existing methods struggle with accurately identifying the rhythmic or semantic triggers from audio for generating contextualized gesture patterns and achieving pixel-level realism. To address these challenges, we introduce Contextual Gesture, a framework that improves co-speech gesture video generation through three innovative components: (1) a chronological speech-gesture alignment that temporally connects two modalities, (2) a contextualized gesture tokenization that incorporate speech context into motion pattern representation through distillation, and (3) a structure-aware refinement module that employs edge connection to link gesture keypoints to improve video generation. Our extensive experiments demonstrate that Contextual Gesture not only produces realistic and speech-aligned gesture videos but also supports long-sequence generation and video gesture editing applications, shown in Fig.1 Project Page: https://andypinxinliu.github.io/Contextual-Gesture/.


Towards Controllable Speech Synthesis in the Era of Large Language Models: A Survey

arXiv.org Artificial Intelligence

Text-to-speech (TTS), also known as speech synthesis, is a prominent research area that aims to generate natural-sounding human speech from text. Recently, with the increasing industrial demand, TTS technologies have evolved beyond synthesizing human-like speech to enabling controllable speech generation. This includes fine-grained control over various attributes of synthesized speech such as emotion, prosody, timbre, and duration. Besides, advancements in deep learning, such as diffusion and large language models, have significantly enhanced controllable TTS over the past several years. In this paper, we conduct a comprehensive survey of controllable TTS, covering approaches ranging from basic control techniques to methods utilizing natural language prompts, aiming to provide a clear understanding of the current state of research. We examine the general controllable TTS pipeline, challenges, model architectures, and control strategies, offering a comprehensive and clear taxonomy of existing methods. Additionally, we provide a detailed summary of datasets and evaluation metrics and shed some light on the applications and future directions of controllable TTS. To the best of our knowledge, this survey paper provides the first comprehensive review of emerging controllable TTS methods, which can serve as a beneficial resource for both academic researchers and industry practitioners.


Behavior Backdoor for Deep Learning Models

arXiv.org Artificial Intelligence

The various post-processing methods for deep-learning-based models, such as quantification, pruning, and fine-tuning, play an increasingly important role in artificial intelligence technology, with pre-train large models as one of the main development directions. However, this popular series of post-processing behaviors targeting pre-training deep models has become a breeding ground for new adversarial security issues. In this study, we take the first step towards ``behavioral backdoor'' attack, which is defined as a behavior-triggered backdoor model training procedure, to reveal a new paradigm of backdoor attacks. In practice, we propose the first pipeline of implementing behavior backdoor, i.e., the Quantification Backdoor (QB) attack, upon exploiting model quantification method as the set trigger. Specifically, to adapt the optimization goal of behavior backdoor, we introduce the behavior-driven backdoor object optimizing method by a bi-target behavior backdoor training loss, thus we could guide the poisoned model optimization direction. To update the parameters across multiple models, we adopt the address-shared backdoor model training, thereby the gradient information could be utilized for multimodel collaborative optimization. Extensive experiments have been conducted on different models, datasets, and tasks, demonstrating the effectiveness of this novel backdoor attack and its potential application threats.


Streamlined Federated Unlearning: Unite as One to Be Highly Efficient

arXiv.org Artificial Intelligence

Recently, the enactment of "right to be forgotten" laws and regulations has imposed new privacy requirements on federated learning (FL). Researchers aim to remove the influence of certain data from the trained model without training from scratch through federated unlearning (FU). While current FU research has shown progress in enhancing unlearning efficiency, it often results in degraded model performance upon achieving the goal of data unlearning, necessitating additional steps to recover the performance of the unlearned model. Moreover, these approaches also suffer from many shortcomings such as high consumption of computational and storage resources. To this end, we propose a streamlined federated unlearning approach (SFU) aimed at effectively removing the influence of target data while preserving the model's performance on the retained data without degradation. We design a practical multi-teacher system that achieves both target data influence removal and model performance preservation by guiding the unlearned model through several distinct teacher models. SFU is both computationally and storage-efficient, highly flexible, and generalizable. We conducted extensive experiments on both image and text benchmark datasets. The results demonstrate that SFU significantly improves time and communication efficiency compared to the benchmark retraining method and significantly outperforms existing state-of-the-art (SOTA) methods. Additionally, we verified the effectiveness of SFU using the backdoor attack.


Rene: A Pre-trained Multi-modal Architecture for Auscultation of Respiratory Diseases

arXiv.org Artificial Intelligence

Compared with invasive examinations that require tissue sampling, respiratory sound testing is a non-invasive examination method that is safer and easier for patients to accept. In this study, we introduce Rene, a pioneering large-scale model tailored for respiratory sound recognition. Rene has been rigorously fine-tuned with an extensive dataset featuring a broad array of respiratory audio samples, targeting disease detection, sound pattern classification, and event identification. Our innovative approach applies a pre-trained speech recognition model to process respiratory sounds, augmented with patient medical records. The resulting multi-modal deep-learning framework addresses interpretability and real-time diagnostic challenges that have hindered previous respiratory-focused models. Benchmark comparisons reveal that Rene significantly outperforms existing models, achieving improvements of 10.27%, 16.15%, 15.29%, and 18.90% in respiratory event detection and audio classification on the SPRSound database. Disease prediction accuracy on the ICBHI database improved by 23% over the baseline in both mean average and harmonic scores. Moreover, we have developed a real-time respiratory sound discrimination system utilizing the Rene architecture. Employing state-of-the-art Edge AI technology, this system enables rapid and accurate responses for respiratory sound auscultation(https://github.com/zpforlove/Rene). KEYWORDS


Collaborative Edge AI Inference over Cloud-RAN

arXiv.org Artificial Intelligence

In this paper, a cloud radio access network (Cloud-RAN) based collaborative edge AI inference architecture is proposed. Specifically, geographically distributed devices capture real-time noise-corrupted sensory data samples and extract the noisy local feature vectors, which are then aggregated at each remote radio head (RRH) to suppress sensing noise. To realize efficient uplink feature aggregation, we allow each RRH receives local feature vectors from all devices over the same resource blocks simultaneously by leveraging an over-the-air computation (AirComp) technique. Thereafter, these aggregated feature vectors are quantized and transmitted to a central processor (CP) for further aggregation and downstream inference tasks. Our aim in this work is to maximize the inference accuracy via a surrogate accuracy metric called discriminant gain, which measures the discernibility of different classes in the feature space. The key challenges lie on simultaneously suppressing the coupled sensing noise, AirComp distortion caused by hostile wireless channels, and the quantization error resulting from the limited capacity of fronthaul links. To address these challenges, this work proposes a joint transmit precoding, receive beamforming, and quantization error control scheme to enhance the inference accuracy. Extensive numerical experiments demonstrate the effectiveness and superiority of our proposed optimization algorithm compared to various baselines.


RCoCo: Contrastive Collective Link Prediction across Multiplex Network in Riemannian Space

arXiv.org Artificial Intelligence

Link prediction typically studies the probability of future interconnection among nodes with the observation in a single social network. More often than not, real scenario is presented as a multiplex network with common (anchor) users active in multiple social networks. In the literature, most existing works study either the intra-link prediction in a single network or inter-link prediction among networks (a.k.a. network alignment), and consider two learning tasks are independent from each other, which is still away from the fact. On the representation space, the vast majority of existing methods are built upon the traditional Euclidean space, unaware of the inherent geometry of social networks. The third issue is on the scarce anchor users. Annotating anchor users is laborious and expensive, and thus it is impractical to work with quantities of anchor users. Herein, in light of the issues above, we propose to study a challenging yet practical problem of Geometry-aware Collective Link Prediction across Multiplex Network. To address this problem, we present a novel contrastive model, RCoCo, which collaborates intra- and inter-network behaviors in Riemannian spaces. In RCoCo, we design a curvature-aware graph attention network ($\kappa-$GAT), conducting attention mechanism in Riemannian manifold whose curvature is estimated by the Ricci curvatures over the network. Thereafter, we formulate intra- and inter-contrastive loss in the manifolds, in which we augment graphs by exploring the high-order structure of community and information transfer on anchor users. Finally, we conduct extensive experiments with 14 strong baselines on 8 real-world datasets, and show the effectiveness of RCoCo.


Knowledge-Infused LLM-Powered Conversational Health Agent: A Case Study for Diabetes Patients

arXiv.org Artificial Intelligence

Effective diabetes management is crucial for maintaining health in diabetic patients. Large Language Models (LLMs) have opened new avenues for diabetes management, facilitating their efficacy. However, current LLM-based approaches are limited by their dependence on general sources and lack of integration with domain-specific knowledge, leading to inaccurate responses. In this paper, we propose a knowledge-infused LLM-powered conversational health agent (CHA) for diabetic patients. We customize and leverage the open-source openCHA framework, enhancing our CHA with external knowledge and analytical capabilities. This integration involves two key components: 1) incorporating the American Diabetes Association dietary guidelines and the Nutritionix information and 2) deploying analytical tools that enable nutritional intake calculation and comparison with the guidelines. We compare the proposed CHA with GPT4. Our evaluation includes 100 diabetes-related questions on daily meal choices and assessing the potential risks associated with the suggested diet. Our findings show that the proposed agent demonstrates superior performance in generating responses to manage essential nutrients.


Reduce Computational Complexity for Convolutional Layers by Skipping Zeros

arXiv.org Artificial Intelligence

Convolutional neural networks necessitate good algorithms to reduce complexity, and sufficient utilization of parallel processors for acceleration. Within convolutional layers, there are three types of operators: convolution used in forward propagation, deconvolution and dilated-convolution utilized in backward propagation. During the execution of these operators, zeros are typically added to tensors, leading to redundant calculations and unnecessary strain on hardware. To circumvent these inefficiencies, we propose the C-K-S algorithm, accompanied by efficient GPU implementations. C-K-S trims filters to exclude zero-padding. For deconvolution and dilated-convolution, C-K-S transforms sparse tensors into dense tensors, and standardizes the local computational rules to simplify the hardware control. The experimental results demonstrate that C-K-S offers good performance in terms of speed and convergence, surpassing the capabilities of PyTorch and cuDNN in certain scenarios.