Goto

Collaborating Authors

 Transfer Learning


Response to Reviews of " Co-Tuning for Transfer Learning "

Neural Information Processing Systems

We thank all reviewers for their detailed reviews. However, the major technique is still feature fine-tuning ( a.k.a. In the following, we respond to common questions first and then to major concerns of each reviewer. Each dataset has a train/test split. Each method has access to the same set of training data.


Finding the most relevant auxiliary forecasting tasks for pre-training and knowledge transferring to a given primary

Neural Information Processing Systems

We thank the reviewers for valuable and timely comments. We'd like to first emphasize the challenges and contributions: Section 3.2 explains how to calculate this hyper-gradient of Framework for BackPropagation, LeCun, 1988), and widely adopted in the literature [14, 15, 35]. We would like to further polish the notation to be more consistent. 'Pretrain (Top)' is much better than'Pretrain (Down)'.



Supplementary Materials for LST: Ladder Side-Tuning for Parameter and Memory Efficient Transfer Learning

Neural Information Processing Systems

As presented in Section 3.2, our side networks are built on Transformer blocks (same as the backbone Accuracy on GLUE (%) Adapter block + gates 2.07 6.5 83.1 Transformer block + cross attention 2.68 10.4 83.0 Transformer block + gates (current design) 2.29 7.0 83.8 Table 2: Hyper-parameters used for NLP experiments. Batch size is 100 for all methods.Method Learning Rate Other Hyper-parameters Full fine-tuning 3 10 Batch size is 300 for all methods.Method Learning Rate Other Hyper-parameters Full fine-tuning 3 10


Dynamic Mixture-of-Experts for Incremental Graph Learning

arXiv.org Artificial Intelligence

Graph incremental learning is a learning paradigm that aims to adapt trained models to continuously incremented graphs and data over time without the need for retraining on the full dataset. However, regular graph machine learning methods suffer from catastrophic forgetting when applied to incremental learning settings, where previously learned knowledge is overridden by new knowledge. Previous approaches have tried to address this by treating the previously trained model as an inseparable unit and using techniques to maintain old behaviors while learning new knowledge. These approaches, however, do not account for the fact that previously acquired knowledge at different timestamps contributes differently to learning new tasks. Some prior patterns can be transferred to help learn new data, while others may deviate from the new data distribution and be detrimental. To address this, we propose a dynamic mixture-of-experts (DyMoE) approach for incremental learning. Specifically, a DyMoE GNN layer adds new expert networks specialized in modeling the incoming data blocks. We design a customized regularization loss that utilizes data sequence information so existing experts can maintain their ability to solve old tasks while helping the new expert learn the new data effectively. As the number of data blocks grows over time, the computational cost of the full mixture-of-experts (MoE) model increases. To address this, we introduce a sparse MoE approach, where only the top-$k$ most relevant experts make predictions, significantly reducing the computation time. Our model achieved 4.92\% relative accuracy increase compared to the best baselines on class incremental learning, showing the model's exceptional power.


Multi-level Collaborative Distillation Meets Global Workspace Model: A Unified Framework for OCIL

arXiv.org Artificial Intelligence

--Online Class-Incremental Learning (OCIL) enables models to learn continuously from non-i.i.d. However, OCIL faces two key challenges: maintaining model stability under strict memory constraints and ensuring adaptability to new tasks. Under stricter memory constraints, current replay-based methods are less effective. While ensemble methods improve adaptability (plasticity), they often struggle with stability. T o overcome these challenges, we propose a novel approach that enhances ensemble learning through a Global Workspace Model (GWM)--a shared, implicit memory that guides the learning of multiple student models. The GWM is formed by fusing the parameters of all students within each training batch, capturing the historical learning trajectory and serving as a dynamic anchor for knowledge consolidation. This fused model is then redistributed periodically to the students to stabilize learning and promote cross-task consistency. In addition, we introduce a multi-level collaborative distillation mechanism. This approach enforces peer-to-peer consistency among students and preserves historical knowledge by aligning each student with the GWM. As a result, student models remain adaptable to new tasks while maintaining previously learned knowledge, striking a better balance between stability and plasticity. Extensive experiments on three standard OCIL benchmarks show that our method delivers significant performance improvement for several OCIL models across various memory budgets. Class-Incremental Learning is designed to integrate the knowledge of classes from a stream of data with an evolved distribution [1].


From Time-series Generation, Model Selection to Transfer Learning: A Comparative Review of Pixel-wise Approaches for Large-scale Crop Mapping

arXiv.org Artificial Intelligence

Crop mapping involves identifying and classifying crop types using spatial data, primarily derived from remote sensing imagery. This study presents the first comprehensive review of large-scale, pixel-wise crop mapping workflows, encompassing both conventional supervised methods and emerging transfer learning approaches. To identify the optimal time-series generation approaches and supervised crop mapping models, we conducted systematic experiments, comparing six widely adopted satellite image-based preprocessing methods, alongside eleven supervised pixel-wise classification models. Additionally, we assessed the synergistic impact of varied training sample sizes and variable combinations. Moreover, we identified optimal transfer learning techniques for different magnitudes of domain shift. The evaluation of optimal methods was conducted across five diverse agricultural sites. Landsat 8 served as the primary satellite data source. Labels come from CDL trusted pixels and field surveys. Our findings reveal three key insights. First, fine-scale interval preprocessing paired with Transformer models consistently delivered optimal performance for both supervised and transferable workflows. RF offered rapid training and competitive performance in conventional supervised learning and direct transfer to similar domains. Second, transfer learning techniques enhanced workflow adaptability, with UDA being effective for homogeneous crop classes while fine-tuning remains robust across diverse scenarios. Finally, workflow choice depends heavily on the availability of labeled samples. With a sufficient sample size, supervised training typically delivers more accurate and generalizable results. Below a certain threshold, transfer learning that matches the level of domain shift is a viable alternative to achieve crop mapping. All code is publicly available to encourage reproducibility practice.


Can Multitask Learning Enhance Model Explainability?

arXiv.org Artificial Intelligence

Remote sensing provides satellite data in diverse types and formats. The usage of multimodal learning networks exploits this diversity to improve model performance, except that the complexity of such networks comes at the expense of their interpretability. In this study, we explore how modalities can be leveraged through multitask learning to intrinsically explain model behavior. In particular, instead of additional inputs, we use certain modalities as additional targets to be predicted along with the main task. The success of this approach relies on the rich information content of satellite data, which remains as input modalities. We show how this modeling context provides numerous benefits: (1) in case of data scarcity, the additional modalities do not need to be collected for model inference at deployment, (2) the model performance remains comparable to the multimodal baseline performance, and in some cases achieves better scores, (3) prediction errors in the main task can be explained via the model behavior in the auxiliary task(s). We demonstrate the efficiency of our approach on three datasets, including segmentation, classification, and regression tasks.


Data-Driven Spectrum Demand Prediction: A Spatio-Temporal Framework with Transfer Learning

arXiv.org Artificial Intelligence

Accurate spectrum demand prediction is crucial for informed spectrum allocation, effective regulatory planning, and fostering sustainable growth in modern wireless communication networks. It supports governmental efforts, particularly those led by the international telecommunication union (ITU), to establish fair spectrum allocation policies, improve auction mechanisms, and meet the requirements of emerging technologies such as advanced 5G, forthcoming 6G, and the internet of things (IoT). This paper presents an effective spatio-temporal prediction framework that leverages crowdsourced user-side key performance indicators (KPIs) and regulatory datasets to model and forecast spectrum demand. The proposed methodology achieves superior prediction accuracy and cross-regional generalizability by incorporating advanced feature engineering, comprehensive correlation analysis, and transfer learning techniques. Unlike traditional ITU models, which are often constrained by arbitrary inputs and unrealistic assumptions, this approach exploits granular, data-driven insights to account for spatial and temporal variations in spectrum utilization. Comparative evaluations against ITU estimates, as the benchmark, underscore our framework's capability to deliver more realistic and actionable predictions. Experimental results validate the efficacy of our methodology, highlighting its potential as a robust approach for policymakers and regulatory bodies to enhance spectrum management and planning.


Active Learning and Transfer Learning for Anomaly Detection in Time-Series Data

arXiv.org Artificial Intelligence

This paper examines the effectiveness of combining active learning and transfer learning for anomaly detection in cross-domain time-series data. Our results indicate that there is an interaction between clustering and active learning and in general the best performance is achieved using a single cluster (in other words when clustering is not applied). Also, we find that adding new samples to the training set using active learning does improve model performance but that in general, the rate of improvement is slower than the results reported in the literature suggest. We attribute this difference to an improved experimental design where distinct data samples are used for the sampling and testing pools. Finally, we assess the ceiling performance of transfer learning in combination with active learning across several datasets and find that performance does initially improve but eventually begins to tail off as more target points are selected for inclusion in training. This tail-off in performance may indicate that the active learning process is doing a good job of sequencing data points for selection, pushing the less useful points towards the end of the selection process and that this tail-off occurs when these less useful points are eventually added. Taken together our results indicate that active learning is effective but that the improvement in model performance follows a linear flat function concerning the number of points selected and labelled.