Lu, Wang
A Language Anchor-Guided Method for Robust Noisy Domain Generalization
Dai, Zilin, Wang, Lehong, Lin, Fangzhou, Wang, Yidong, Li, Zhigang, Yamada, Kazunori D, Zhang, Ziming, Lu, Wang
Abstract--Real-world machine learning applications are often hindered by two critical challenges: distribution shift and label noise. Networks inherently tend to overfit to redundant, uninformative features present in the training distribution, which undermines their ability to generalize effectively to the target domain's distribution. The presence of noisy data further exacerbates this issue by inducing additional overfitting to noise, causing existing domain generalization methods to fail in effectively distinguishing invariant features from spurious ones. We also introduce a weighted loss function that dynamically adjusts the contribution of each sample based on its distance to the corresponding NLP anchor, thereby improving the model's resilience to noisy labels. Generalization (DG) has emerged as a pivotal algorithm in machine learning, aiming to develop models that can maintain high performance on previously unseen environments--or domains. T raditional methods often assume that training and test data share the same distribution, yet in real-world scenarios, there is frequently a substantial shift between these distributions. This phenomenon, widely referred to as domain shift, can cause severe performance degradation in tasks spanning computer vision, natural language processing, and medical image analysis [1]. As shown in Figure 1(a)(b), even within the same class label, the distribution of feature representations can vary considerably . This variation may stem from differences in image acquisition conditions--such as lighting variations, changes in pose, or complex background environments--and even from more subtle domain-specific factors like sensor noise or camera calibration differences. Such intra-class variability poses a significant challenge for developing accurate and adaptable models, which must learn to extract invariant features that capture the true semantic essence of the class while ignoring irrelevant variations. Lin, Z. Zhang is with Worcester Polytechnic Institute, Worcester, MA, 01890. L.Wang is with Carnegie Mellon University, Pittsburgh, P A, 15213. Y .Wang is with Peking University, Beijing, China, 100871. Z.Li, W.Lu is with T singhua University, Beijing, China, 100190. K.Y amada is with T ohoku University, Sendai, Japan, 980-8572.
Optimal Transport for Brain-Image Alignment: Unveiling Redundancy and Synergy in Neural Information Processing
Xiao, Yang, Lu, Wang, Ji, Jie, Ye, Ruimeng, Li, Gen, Ma, Xiaolong, Hui, Bo
The design of artificial neural networks (ANNs) is inspired by the structure of the human brain, and in turn, ANNs offer a potential means to interpret and understand brain signals. Existing methods primarily align brain signals with real-world signals using Mean Squared Error (MSE), which solely focuses on local point-wise alignment, and ignores global matching, leading to coarse interpretations and inaccuracies in brain signal decoding. In this paper, we address these issues through optimal transport (OT) and theoretically demonstrate why OT provides a more effective alignment strategy than MSE. Specifically, we construct a transport plan between brain voxel embeddings and image embeddings, enabling more precise matching. By controlling the amount of transport, we mitigate the influence of redundant information. We apply our alignment model directly to the Brain Captioning task by feeding brain siginals into a large language model (LLM) instead of images. Our approach achieves state-of-the-art performance across ten evaluation metrics, surpassing the previous best method by an average of 6.11\% in single-subject training and 3.81\% in cross-subject training. Additionally, we have uncovered several insightful conclusions that align with existing brain research. We unveil the redundancy and synergy of brain information processing through region masking and data dimensionality reduction visualization experiments. We believe our approach paves the way for a more precise understanding of brain signals in the future. The code is available soon.
Survey on Knowledge Distillation for Large Language Models: Methods, Evaluation, and Application
Yang, Chuanpeng, Lu, Wang, Zhu, Yao, Wang, Yidong, Chen, Qian, Gao, Chenlong, Yan, Bingjie, Chen, Yiqiang
Large Language Models (LLMs) have showcased exceptional capabilities in various domains, attracting significant interest from both academia and industry. Despite their impressive performance, the substantial size and computational demands of LLMs pose considerable challenges for practical deployment, particularly in environments with limited resources. The endeavor to compress language models while maintaining their accuracy has become a focal point of research. Among the various methods, knowledge distillation has emerged as an effective technique to enhance inference speed without greatly compromising performance. This paper presents a thorough survey from three aspects: method, evaluation, and application, exploring knowledge distillation techniques tailored specifically for LLMs. Specifically, we divide the methods into white-box KD and black-box KD to better illustrate their differences. Furthermore, we also explored the evaluation tasks and distillation effects between different distillation methods, and proposed directions for future research. Through in-depth understanding of the latest advancements and practical applications, this survey provides valuable resources for researchers, paving the way for sustained progress in this field.
Towards Optimization and Model Selection for Domain Generalization: A Mixup-guided Solution
Lu, Wang, Wang, Jindong, Wang, Yidong, Xie, Xing
The distribution shifts between training and test data typically undermine the performance of models. In recent years, lots of work pays attention to domain generalization (DG) where distribution shifts exist, and target data are unseen. Despite the progress in algorithm design, two foundational factors have long been ignored: 1) the optimization for regularization-based objectives, and 2) the model selection for DG since no knowledge about the target domain can be utilized. In this paper, we propose Mixup guided optimization and selection techniques for DG. For optimization, we utilize an adapted Mixup to generate an out-of-distribution dataset that can guide the preference direction and optimize with Pareto optimization. For model selection, we generate a validation dataset with a closer distance to the target distribution, and thereby it can better represent the target data. We also present some theoretical insights behind our proposals. Comprehensive experiments demonstrate that our model optimization and selection techniques can largely improve the performance of existing domain generalization algorithms and even achieve new state-of-the-art results.
FIXED: Frustratingly Easy Domain Generalization with Mixup
Lu, Wang, Wang, Jindong, Yu, Han, Huang, Lei, Zhang, Xiang, Chen, Yiqiang, Xie, Xing
Domain generalization (DG) aims to learn a generalizable model from multiple training domains such that it can perform well on unseen target domains. A popular strategy is to augment training data to benefit generalization through methods such as Mixup~\cite{zhang2018mixup}. While the vanilla Mixup can be directly applied, theoretical and empirical investigations uncover several shortcomings that limit its performance. Firstly, Mixup cannot effectively identify the domain and class information that can be used for learning invariant representations. Secondly, Mixup may introduce synthetic noisy data points via random interpolation, which lowers its discrimination capability. Based on the analysis, we propose a simple yet effective enhancement for Mixup-based DG, namely domain-invariant Feature mIXup (FIX). It learns domain-invariant representations for Mixup. To further enhance discrimination, we leverage existing techniques to enlarge margins among classes to further propose the domain-invariant Feature MIXup with Enhanced Discrimination (FIXED) approach. We present theoretical insights about guarantees on its effectiveness. Extensive experiments on seven public datasets across two modalities including image classification (Digits-DG, PACS, Office-Home) and time series (DSADS, PAMAP2, UCI-HAR, and USC-HAD) demonstrate that our approach significantly outperforms nine state-of-the-art related methods, beating the best performing baseline by 6.5\% on average in terms of test accuracy. Code is available at: https://github.com/jindongwang/transferlearning/tree/master/code/deep/fixed.
ZooPFL: Exploring Black-box Foundation Models for Personalized Federated Learning
Lu, Wang, Yu, Hao, Wang, Jindong, Teney, Damien, Wang, Haohan, Chen, Yiqiang, Yang, Qiang, Xie, Xing, Ji, Xiangyang
When personalized federated learning (FL) meets large foundation models, new challenges arise from various limitations in resources. In addition to typical limitations such as data, computation, and communication costs, access to the models is also often limited. This paper endeavors to solve both the challenges of limited resources and personalization. PFL that uses Zeroth-Order Optimization for Personalized Federated Learning. PFL avoids direct interference with the foundation models and instead learns to adapt its inputs through zeroth-order optimization. In addition, we employ simple yet effective linear projections to remap its predictions for personalization. To reduce the computation costs and enhance personalization, we propose input surgery to incorporate an auto-encoder with low-dimensional and client-specific embeddings. PFL to analyze its convergence. Extensive empirical experiments on computer vision and natural language processing tasks using popular foundation models demonstrate its effectiveness for FL on black-box foundation models. In recent years, the growing emphasis on data privacy and security has led to the emergence of federated learning (FL) (Warnat-Herresthal et al., 2021; Chen & Chao, 2022; Chen et al., 2023b; Castiglia et al., 2023; Rodrรญguez-Barroso et al., 2023; Kuang et al., 2023). FL enables collaborative learning while safeguarding data privacy and security across distributed clients (Yang et al., 2019). However, FL faces two key challenges: limited resources and distribution shifts (Figure 1 (a, b)). The rise of large foundation models (Bommasani et al., 2021) has amplified these challenges. The computational demands and communication costs associated with such models hinder the deployment of existing FL approaches (Figure 1a).
DIVERSIFY: A General Framework for Time Series Out-of-distribution Detection and Generalization
Lu, Wang, Wang, Jindong, Sun, Xinwei, Chen, Yiqiang, Ji, Xiangyang, Yang, Qiang, Xie, Xing
Time series remains one of the most challenging modalities in machine learning research. The out-of-distribution (OOD) detection and generalization on time series tend to suffer due to its non-stationary property, i.e., the distribution changes over time. The dynamic distributions inside time series pose great challenges to existing algorithms to identify invariant distributions since they mainly focus on the scenario where the domain information is given as prior knowledge. In this paper, we attempt to exploit subdomains within a whole dataset to counteract issues induced by non-stationary for generalized representation learning. We propose DIVERSIFY, a general framework, for OOD detection and generalization on dynamic distributions of time series. DIVERSIFY takes an iterative process: it first obtains the "worst-case" latent distribution scenario via adversarial training, then reduces the gap between these latent distributions. We implement DIVERSIFY via combining existing OOD detection methods according to either extracted features or outputs of models for detection while we also directly utilize outputs for classification. In addition, theoretical insights illustrate that DIVERSIFY is theoretically supported. Extensive experiments are conducted on seven datasets with different OOD settings across gesture recognition, speech commands recognition, wearable stress and affect detection, and sensor-based human activity recognition. Qualitative and quantitative results demonstrate that DIVERSIFY learns more generalized features and significantly outperforms other baselines.
MetaFed: Federated Learning among Federations with Cyclic Knowledge Distillation for Personalized Healthcare
Chen, Yiqiang, Lu, Wang, Qin, Xin, Wang, Jindong, Xie, Xing
Federated learning has attracted increasing attention to building models without accessing the raw user data, especially in healthcare. In real applications, different federations can seldom work together due to possible reasons such as data heterogeneity and distrust/inexistence of the central server. In this paper, we propose a novel framework called MetaFed to facilitate trustworthy FL between different federations. MetaFed obtains a personalized model for each federation without a central server via the proposed Cyclic Knowledge Distillation. Specifically, MetaFed treats each federation as a meta distribution and aggregates knowledge of each federation in a cyclic manner. The training is split into two parts: common knowledge accumulation and personalization. Comprehensive experiments on three benchmarks demonstrate that MetaFed without a server achieves better accuracy compared to state-of-the-art methods (e.g., 10%+ accuracy improvement compared to the baseline for PAMAP2) with fewer communication costs.
FedCLIP: Fast Generalization and Personalization for CLIP in Federated Learning
Lu, Wang, Hu, Xixu, Wang, Jindong, Xie, Xing
Federated learning (FL) has emerged as a new paradigm for privacy-preserving computation in recent years. Unfortunately, FL faces two critical challenges that hinder its actual performance: data distribution heterogeneity and high resource costs brought by large foundation models. Specifically, the non-IID data in different clients make existing FL algorithms hard to converge while the high resource costs, including computational and communication costs that increase the deployment difficulty in real-world scenarios. In this paper, we propose an effective yet simple method, named FedCLIP, to achieve fast generalization and personalization for CLIP in federated learning. Concretely, we design an attention-based adapter for the large model, CLIP, and the rest operations merely depend on adapters. Lightweight adapters can make the most use of pretrained model information and ensure models be adaptive for clients in specific tasks. Simultaneously, small-scale operations can mitigate the computational burden and communication burden caused by large models. Extensive experiments are conducted on three datasets with distribution shifts. Qualitative and quantitative results demonstrate that FedCLIP significantly outperforms other baselines (9% overall improvements on PACS) and effectively reduces computational and communication costs (283x faster than FedAVG). Our code will be available at: https://github.com/microsoft/PersonalizedFL.
Generalizable Low-Resource Activity Recognition with Diverse and Discriminative Representation Learning
Qin, Xin, Wang, Jindong, Ma, Shuo, Lu, Wang, Zhu, Yongchun, Xie, Xing, Chen, Yiqiang
Human activity recognition (HAR) is a time series classification task that focuses on identifying the motion patterns from human sensor readings. Adequate data is essential but a major bottleneck for training a generalizable HAR model, which assists customization and optimization of online web applications. However, it is costly in time and economy to collect large-scale labeled data in reality, i.e., the low-resource challenge. Meanwhile, data collected from different persons have distribution shifts due to different living habits, body shapes, age groups, etc. The low-resource and distribution shift challenges are detrimental to HAR when applying the trained model to new unseen subjects. In this paper, we propose a novel approach called Diverse and Discriminative representation Learning (DDLearn) for generalizable low-resource HAR. DDLearn simultaneously considers diversity and discrimination learning. With the constructed self-supervised learning task, DDLearn enlarges the data diversity and explores the latent activity properties. Then, we propose a diversity preservation module to preserve the diversity of learned features by enlarging the distribution divergence between the original and augmented domains. Meanwhile, DDLearn also enhances semantic discrimination by learning discriminative representations with supervised contrastive learning. Extensive experiments on three public HAR datasets demonstrate that our method significantly outperforms state-of-art methods by an average accuracy improvement of 9.5% under the low-resource distribution shift scenarios, while being a generic, explainable, and flexible framework. Code is available at: https://github.com/microsoft/robustlearn.