Goto

Collaborating Authors

 lst



Appendix

Neural Information Processing Systems

In practice, building f and g requires the computation for wtiwtj for all i,j. B.2 Classification For the classification task with the logistic regression model, we modify the formula of logistic regression in teaching objectives to make it convenient for derivation. It also indicates that with probability at least p1, the LST teacher can achieve exponential teachability in the iteration t. In order to achieve exponential teachiability in T iterations, the sufficient condition in Eq. (22) must be satisfied in all T iterations. Then, we use a pre-trained DenseNet [65] shown in [53] to generate 1024 dim features and the confidencescoreforeachimage.



54801e196796134a2b0ae5e8adef502f-Paper-Conference.pdf

Neural Information Processing Systems

Although recently proposed parameter-efficient transfer learning (PETL) techniques allowupdating asmallsubsetofparameters (e.g. This is because the gradient computation for the trainable parameters still requires backpropagation through thelargepre-trained backbone model.



Locality Sensitive Teaching

Neural Information Processing Systems

The emergence of the Internet-of-Things (IoT) sheds light on applying the machine teaching (MT) algorithms for online personalized education on home devices. This direction becomes more promising during the COVID-19 pandemic when in-person education becomes infeasible. However, as one of the most influential and practical MT paradigms, iterative machine teaching (IMT) is prohibited on IoT devices due to its inefficient and unscalable algorithms. IMT is a paradigm where a teacher feeds examples iteratively and intelligently based on the learner's status. In each iteration, current IMT algorithms greedily traverse the whole training set to find an example for the learner, which is computationally expensive in practice. We propose a novel teaching framework, Locality Sensitive Teaching (LST), based on locality sensitive sampling, to overcome these challenges. LST has provable near-constant time complexity, which is exponentially better than the existing baseline.


LST: Ladder Side-Tuning for Parameter and Memory Efficient Transfer Learning

Neural Information Processing Systems

Fine-tuning large pre-trained models on downstream tasks has been adopted in a variety of domains recently. However, it is costly to update the entire parameter set of large pre-trained models. Although recently proposed parameter-efficient transfer learning (PETL) techniques allow updating a small subset of parameters (e.g.


Latent Speech-Text Transformer

Lu, Yen-Ju, Gaur, Yashesh, Zhou, Wei, Muller, Benjamin, Villalba, Jesus, Dehak, Najim, Zettlemoyer, Luke, Ghosh, Gargi, Lewis, Mike, Iyer, Srinivasan, Le, Duc

arXiv.org Artificial Intelligence

Auto-regressive speech-text models are typically pre-trained on a large number of interleaved sequences of text tokens and raw speech encoded as speech tokens using vector quantization. These models have demonstrated state-of-the-art performance in speech-to-speech understanding and generation benchmarks, together with promising scaling laws, primarily enabled by the representational alignment between text and speech. Nevertheless, they suffer from shortcomings, partly owing to the disproportionately longer sequences of speech tokens in contrast to textual tokens. This results in a large compute imbalance between modalities during pre-training as well as during inference, and a potential hindrance to effectively aligning speech and text, ultimately translating to several orders of magnitude slower scaling laws. We introduce the Latent Speech-Text Transformer (LST), which makes pre-training speech-text models more data-efficient by dynamically and inexpensively aggregating speech tokens into latent speech patches. These patches serve as higher-level units that can either align with corresponding textual units to aid capability transfer or even encapsulate common speech sequences like silences to be more compute-efficient. We show that LST outperforms vanilla approaches on speech-to-speech as well as text-to-text benchmarks in both data- and compute-controlled settings, the former indicating more effective representational alignment and the latter indicating steeper scaling laws for speech-text models. On HellaSwag story completion, LST achieves 6.5% absolute gain in speech accuracy under compute-controlled training and 5.3% under data-controlled training, while also improving text performance. We will release our models, code, and the evaluation data to facilitate further research.


In vanilla task

Neural Information Processing Systems

Sorry that we don't have this experiment. This is because the training splits of dataset are different between SSFSC and FSC (see details in the Sec. On the FSC dataset, a big proportion of labeled data are used as unlabeled for sampling SSFSC tasks. Therefore, the total SSFSC training tasks contain less supervision (than FSC). We found that V A T brings limited improvement, e.g. less than "+recursive" and "+mixing" are actually While, "+mixing" has only one stage, using Q6: Require a large number of unlabeled samples.