Goto

Collaborating Authors

 cosine distance


ReplaceMe: Network Simplification via Depth Pruning and Transformer Block Linearization

Neural Information Processing Systems

We introduce ReplaceMe, a generalized training-free depth pruning method that effectively replaces transformer blocks with a linear operation, while maintaining high performance for low compression ratios. In contrast to conventional pruning approaches that require additional training or fine-tuning, our approach requires only a small calibration dataset that is used to estimate a linear transformation, which approximates the pruned blocks. The estimated linear mapping can be seamlessly merged with the remaining transformer blocks, eliminating the need for any additional network parameters. Our experiments show that ReplaceMe consistently outperforms other training-free approaches and remains highly competitive with state-of-the-art pruning methods that involve extensive retraining/fine-tuning and architectural modifications. Applied to several large language models (LLMs), ReplaceMe achieves up to 25% pruning while retaining approximately 90% of the original model's performance on open benchmarks--without any training or healing steps, resulting in minimal computational overhead.


02a32ad2669e6fe298e607fe7cc0e1a0-AuthorFeedback.pdf

Neural Information Processing Systems

We thank all the reviewers (R1,R2,R3) for their feedback and suggestions.1 Table A: Multi-task comparison across task weights. We have per-2 formed loss balancing with five different weights t3 in the multi-task loss Lm = t Lc +(1 t) Lr for4 the classification and regression losses. The results5 on OmniArt are reported in Table A. Our proposal6 is robust to the weight value, tuning the task weight7 is not vital. We obtain a moderate gain for both clas-8 sification and regression with a weight of t = 0.25.9 For the multi-task baseline, emphasizing regression10 reduces the regression error, as the gradient magnitude of the regression loss is much lower than the one for the11 classification loss.


Appendix information on the relationship between our training approach and domain adaptation

Neural Information Processing Systems

Here we note our problem definition of pre-training is fundamentally different from domain adaptation [S1, S2, S3, S4, S5, S6]1 in order to prevent any confusion between this work and domain adaptation methods. DA applies a model trained on a pre-training dataset (i.e., source dataset) to a different target dataset [21, 42]. In contrast, self-supervised pre-training has four key differences with domain adaptation. In contrast, domain adaptation methods usually restrict pre-training and target datasets to have the same feature space (but possible different distributions), e.g., [S22, S18, S19, S20, S13]. In summary, to support transfer learning across different time series datasets, a pre-training approach needs a capability to capture a generalizable property of time series, one that is shared across different time series datasets regardless of the specific semantic meaning of a time series signal (e.g., ECG, EMG, acceleration, vibration), conditions of data acquisition (e.g., variation across subjects and devices), sampling frequencies, etc. This work develops a self-supervised contrastive pre-training strategy that fulfills these requirements by injecting an appropriate inductive bias (called Time-Frequency Consistency, TF-C, into the model (Sec. Further, we clarify that the term'self-supervised' has different meanings in DA and in pretraining [S23, S24, S25, S26]. The'self-supervised domain adaptation' [S27, S16, S21, S15] or'unsupervised domain adaptation' [S1, S22, S28, S11, S14] means that there are no labels in the target dataset, however that still requires labels in the pre-training dataset. In contrast, 'self-supervised pretraining' [S29, S30, S31] (i.e., the problem studied here, in line with a breadth of existing literature on pre-training) indicates the setting where no labels are available in pre-training. Up to the submission of this manuscript, there is no existing contrastive augmentations in time series' frequency domain. There are two models, CoST [49] and BTSF [50], that involved frequency domain in contrastive learning, however, the proposed TF-C is fundamentally different with them in the following aspects. We take BTSF as an example while the differences also apply to CoST. Problem definitions for both papers are different. Our method is designed to produce generalizable representations that can transfer to a different time series dataset (going from pre-training to a fine-tuning dataset) for the purpose of transfer learning.


RGMDT: Return-Gap-MinimizingDecisionTree ExtractioninNon-EuclideanMetricSpace

Neural Information Processing Systems

In this paper, we establish an upper bound on the return gap between the oracle expert policy and an optimal decision tree policy. This enables us to recast the DT extraction problem into a novel non-euclidean clustering problem over the local observation and action values space of each agent, with action values as cluster labels and the upper bound on the return gap as clustering loss.



Implicit variance regularization in non-contrastive SSL

Neural Information Processing Systems

In this work, we provide a comparative analysis of the learning dynamics for the Euclidean and cosine-based asymmetric losses in the eigenspace of the closed-form predictor DirectPred.