skip layer
Skip-Layer Attention: Bridging Abstract and Detailed Dependencies in Transformers
Chen, Qian, Wang, Wen, Zhang, Qinglin, Zheng, Siqi, Zhang, Shiliang, Deng, Chong, Yu, Hai, Liu, Jiaqing, Ma, Yukun, Zhang, Chong
The Transformer architecture has significantly advanced deep learning, particularly in natural language processing, by effectively managing long-range dependencies. However, as the demand for understanding complex relationships grows, refining the Transformer's architecture becomes critical. This paper introduces Skip-Layer Attention (SLA) to enhance Transformer models by enabling direct attention between non-adjacent layers. This method improves the model's ability to capture dependencies between high-level abstract features and low-level details. By facilitating direct attention between these diverse feature levels, our approach overcomes the limitations of current Transformers, which often rely on suboptimal intra-layer attention. Our implementation extends the Transformer's functionality by enabling queries in a given layer to interact with keys and values from both the current layer and one preceding layer, thus enhancing the diversity of multi-head attention without additional computational burden. Extensive experiments demonstrate that our enhanced Transformer model achieves superior performance in language modeling tasks, highlighting the effectiveness of our skip-layer attention mechanism.
PEMMA: Parameter-Efficient Multi-Modal Adaptation for Medical Image Segmentation
Saadi, Nada, Saeed, Numan, Yaqub, Mohammad, Nandakumar, Karthik
Imaging modalities such as Computed Tomography (CT) and Positron Emission Tomography (PET) are key in cancer detection, inspiring Deep Neural Networks (DNN) models that merge these scans for tumor segmentation. When both CT and PET scans are available, it is common to combine them as two channels of the input to the segmentation model. However, this method requires both scan types during training and inference, posing a challenge due to the limited availability of PET scans, thereby sometimes limiting the process to CT scans only. Hence, there is a need to develop a flexible DNN architecture that can be trained/updated using only CT scans but can effectively utilize PET scans when they become available. In this work, we propose a parameter-efficient multi-modal adaptation (PEMMA) framework for lightweight upgrading of a transformer-based segmentation model trained only on CT scans to also incorporate PET scans. The benefits of the proposed approach are two-fold. Firstly, we leverage the inherent modularity of the transformer architecture and perform low-rank adaptation (LoRA) of the attention weights to achieve parameter-efficient adaptation. Secondly, since the PEMMA framework attempts to minimize cross modal entanglement, it is possible to subsequently update the combined model using only one modality, without causing catastrophic forgetting of the other modality. Our proposed method achieves comparable results with the performance of early fusion techniques with just 8% of the trainable parameters, especially with a remarkable +28% improvement on the average dice score on PET scans when trained on a single modality.
Two Novel Performance Improvements for Evolving CNN Topologies
Convolutional Neural Networks (CNNs) are the state-of-the-art algorithms for the processing of images. However the configuration and training of these networks is a complex task requiring deep domain knowledge, experience and much trial and error. Using genetic algorithms, competitive CNN topologies for image recognition can be produced for any specific purpose, however in previous work this has come at high computational cost. In this work two novel approaches are presented to the utilisation of these algorithms, effective in reducing complexity and training time by nearly 20%. This is accomplished via regularisation directly on training time, and the use of partial training to enable early ranking of individual architectures. Both approaches are validated on the benchmark CIFAR10 data set, and maintain accuracy.
Brain Signal Classification via Learning Connectivity Structure
Jang, Soobeom, Moon, Seong-Eun, Lee, Jong-Seok
Connectivity between different brain regions is one of the most important properties for classification of brain signals including electroencephalography (EEG). However, how to define the connectivity structure for a given task is still an open problem, because there is no ground truth about how the connectivity structure should be in order to maximize the performance. In this paper, we propose an end-to-end neural network model for EEG classification, which can extract an appropriate multi-layer graph structure and signal features directly from a set of raw EEG signals and perform classification. Experimental results demonstrate that our method yields improved performance in comparison to the existing approaches where manually defined connectivity structures and signal features are used. Furthermore, we show that the graph structure extraction process is reliable in terms of consistency, and the learned graph structures make much sense in the neuroscientific viewpoint.
The idea is ridiculously simple (perhaps why it is effective?): randomly skip layers while training • /r/MachineLearning
The idea is ridiculously simple (perhaps why it is effective?): I don't understand the claim "Remember all the narratives we told about how depth learns hierarchical representations, and higher level representations -- those higher level representations don't seem to matter so much after all.". The net has over 100 layers!?! I imagine that this also works reasonably well in the RNN encoder in an encoder/decoder framework. I wonder if it also applies to generative RNNs.