Goto

Collaborating Authors

 Wang, Feifei


Exploring the Vulnerabilities of Federated Learning: A Deep Dive into Gradient Inversion Attacks

arXiv.org Artificial Intelligence

Federated Learning (FL) has emerged as a promising privacy-preserving collaborative model training paradigm without sharing raw data. However, recent studies have revealed that private information can still be leaked through shared gradient information and attacked by Gradient Inversion Attacks (GIA). While many GIA methods have been proposed, a detailed analysis, evaluation, and summary of these methods are still lacking. Although various survey papers summarize existing privacy attacks in FL, few studies have conducted extensive experiments to unveil the effectiveness of GIA and their associated limiting factors in this context. To fill this gap, we first undertake a systematic review of GIA and categorize existing methods into three types, i.e., \textit{optimization-based} GIA (OP-GIA), \textit{generation-based} GIA (GEN-GIA), and \textit{analytics-based} GIA (ANA-GIA). Then, we comprehensively analyze and evaluate the three types of GIA in FL, providing insights into the factors that influence their performance, practicality, and potential threats. Our findings indicate that OP-GIA is the most practical attack setting despite its unsatisfactory performance, while GEN-GIA has many dependencies and ANA-GIA is easily detectable, making them both impractical. Finally, we offer a three-stage defense pipeline to users when designing FL frameworks and protocols for better privacy protection and share some future research directions from the perspectives of attackers and defenders that we believe should be pursued. We hope that our study can help researchers design more robust FL frameworks to defend against these attacks.


Selective Aggregation for Low-Rank Adaptation in Federated Learning

arXiv.org Artificial Intelligence

Shenyang Institute of Automation, Chinese Academy of Sciences {guopx,zengsh9}@connect.hku.hk, We investigate LoRA in federated learning through the lens of the asymmetry analysis of the learned A and B matrices. In doing so, we uncover that A matrices are responsible for learning general knowledge, while B matrices focus on capturing client-specific knowledge. Based on this finding, we introduce Federated Share-A Low-Rank Adaptation (FedSA-LoRA), which employs two lowrank trainable matrices A and B to model the weight update, but only A matrices are shared with the server for aggregation. Moreover, we delve into the relationship between the learned A and B matrices in other LoRA variants, such as rsLoRA and VeRA, revealing a consistent pattern. Consequently, we extend our FedSA-LoRA method to these LoRA variants, resulting in FedSA-rsLoRA and FedSA-VeRA. In this way, we establish a general paradigm for integrating LoRA with FL, offering guidance for future work on subsequent LoRA variants combined with FL. Extensive experimental results on natural language understanding and generation tasks demonstrate the effectiveness of the proposed method. Large Language Models (LLMs) trained on large amounts of text, referred to as Pre-trained Language Models (PLMs), have become a cornerstone of Natural Language Processing (NLP) (Brown, 2020; Touvron et al., 2023; Achiam et al., 2023; Chowdhery et al., 2023).


A Selective Review on Statistical Methods for Massive Data Computation: Distributed Computing, Subsampling, and Minibatch Techniques

arXiv.org Artificial Intelligence

This paper presents a selective review of statistical computation methods for massive data analysis. A huge amount of statistical methods for massive data computation have been rapidly developed in the past decades. In this work, we focus on three categories of statistical computation methods: (1) distributed computing, (2) subsampling methods, and (3) minibatch gradient techniques. The first class of literature is about distributed computing and focuses on the situation, where the dataset size is too huge to be comfortably handled by one single computer. In this case, a distributed computation system with multiple computers has to be utilized. The second class of literature is about subsampling methods and concerns about the situation, where the sample size of dataset is small enough to be placed on one single computer but too large to be easily processed by its memory as a whole. The last class of literature studies those minibatch gradient related optimization techniques, which have been extensively used for optimizing various deep learning models.


Factor-Assisted Federated Learning for Personalized Optimization with Heterogeneous Data

arXiv.org Machine Learning

Federated learning is an emerging distributed machine learning framework aiming at protecting data privacy. Data heterogeneity is one of the core challenges in federated learning, which could severely degrade the convergence rate and prediction performance of deep neural networks. To address this issue, we develop a novel personalized federated learning framework for heterogeneous data, which we refer to as FedSplit. This modeling framework is motivated by the finding that, data in different clients contain both common knowledge and personalized knowledge. Then the hidden elements in each neural layer can be split into the shared and personalized groups. With this decomposition, a novel objective function is established and optimized. We demonstrate FedSplit enjoyers a faster convergence speed than the standard federated learning method both theoretically and empirically. The generalization bound of the FedSplit method is also studied. To practically implement the proposed method on real datasets, factor analysis is introduced to facilitate the decoupling of hidden elements. This leads to a practically implemented model for FedSplit and we further refer to as FedFac. We demonstrated by simulation studies that, using factor analysis can well recover the underlying shared/personalized decomposition. The superior prediction performance of FedFac is further verified empirically by comparison with various state-of-the-art federated learning methods on several real datasets.


Deep Learning Enables Large Depth-of-Field Images for Sub-Diffraction-Limit Scanning Superlens Microscopy

arXiv.org Artificial Intelligence

Scanning electron microscopy (SEM) is indispensable in diverse applications ranging from microelectronics to food processing because it provides large depth-of-field images with a resolution beyond the optical diffraction limit. However, the technology requires coating conductive films on insulator samples and a vacuum environment. We use deep learning to obtain the mapping relationship between optical super-resolution (OSR) images and SEM domain images, which enables the transformation of OSR images into SEM-like large depth-of-field images. Our custom-built scanning superlens microscopy (SSUM) system, which requires neither coating samples by conductive films nor a vacuum environment, is used to acquire the OSR images with features down to ~80 nm. The peak signal-to-noise ratio (PSNR) and structural similarity index measure values indicate that the deep learning method performs excellently in image-to-image translation, with a PSNR improvement of about 0.74 dB over the optical super-resolution images. The proposed method provides a high level of detail in the reconstructed results, indicating that it has broad applicability to chip-level defect detection, biological sample analysis, forensics, and various other fields.


Knowledge-Enhanced Relation Extraction Dataset

arXiv.org Artificial Intelligence

Recently, knowledge-enhanced methods leveraging auxiliary knowledge graphs have emerged in relation extraction, surpassing traditional text-based approaches. However, to our best knowledge, there is currently no public dataset available that encompasses both evidence sentences and knowledge graphs for knowledge-enhanced relation extraction. To address this gap, we introduce the Knowledge-Enhanced Relation Extraction Dataset (KERED). KERED annotates each sentence with a relational fact, and it provides knowledge context for entities through entity linking. Using our curated dataset, We compared contemporary relation extraction methods under two prevalent task settings: sentence-level and bag-level. The experimental result shows the knowledge graphs provided by KERED can support knowledge-enhanced relation extraction methods. We believe that KERED offers high-quality relation extraction datasets with corresponding knowledge graphs for evaluating the performance of knowledge-enhanced relation extraction methods. Our dataset is available at: \url{https://figshare.com/projects/KERED/134459}


Improved Naive Bayes with Mislabeled Data

arXiv.org Artificial Intelligence

Labeling mistakes are frequently encountered in real-world applications. If not treated well, the labeling mistakes can deteriorate the classification performances of a model seriously. To address this issue, we propose an improved Naive Bayes method for text classification. It is analytically simple and free of subjective judgements on the correct and incorrect labels. By specifying the generating mechanism of incorrect labels, we optimize the corresponding log-likelihood function iteratively by using an EM algorithm. Our simulation and experiment results show that the improved Naive Bayes method greatly improves the performances of the Naive Bayes method with mislabeled data.


High-Resolution Boundary Detection for Medical Image Segmentation with Piece-Wise Two-Sample T-Test Augmented Loss

arXiv.org Artificial Intelligence

Fully automatic segmentation methods, such as liver and liver tumor segmentation, brain and brain tumor segmentation, optic disc segmentation, cell segmentation, lung segmentation, pulmonary nodule segmentation, and cardiac image segmentation [2], are essential for the diagnosis of serious diseases [3]. Therefore, it is important to improve the efficiency and accuracy of medical image segmentation methods. Medical image segmentation involves segmenting specific organs (e.g., the pancreas, liver, and bladder), determining certain functional parts of an organ (e.g., cardiac segmentation and retinal vessel segmentation), and identifying tumors in the organs. Medical images can generally be categorized according to the imaging technology and data form. Imaging technology includes X-ray photos, computed tomography, magnetic resonance imaging (MRI), and ultrasound imaging. Raw measurements are transformed into pixelated imaging data as part of the standard process. Although the original data are mostly three-dimensional images, two-dimensional slices are often created according to clinical procedure protocols that target specific applications. Most medical image segmentation methods are designed for two-dimensional slices.


Jointly Dynamic Topic Model for Recognition of Lead-lag Relationship in Two Text Corpora

arXiv.org Machine Learning

Topic evolution modeling has received significant attentions in recent decades. Although various topic evolution models have been proposed, most studies focus on the single document corpus. However in practice, we can easily access data from multiple sources and also observe relationships between them. Then it is of great interest to recognize the relationship between multiple text corpora and further utilize this relationship to improve topic modeling. In this work, we focus on a special type of relationship between two text corpora, which we define as the "lead-lag relationship". This relationship characterizes the phenomenon that one text corpus would influence the topics to be discussed in the other text corpus in the future. To discover the lead-lag relationship, we propose a jointly dynamic topic model and also develop an embedding extension to address the modeling problem of large-scale text corpus. With the recognized lead-lag relationship, the similarities of the two text corpora can be figured out and the quality of topic learning in both corpora can be improved. We numerically investigate the performance of the jointly dynamic topic modeling approach using synthetic data. Finally, we apply the proposed model on two text corpora consisting of statistical papers and the graduation theses. Results show the proposed model can well recognize the lead-lag relationship between the two corpora, and the specific and shared topic patterns in the two corpora are also discovered.