Goto

Collaborating Authors

 odconv


Heterogeneous Space Fusion and Dual-Dimension Attention: A New Paradigm for Speech Enhancement

Zheng, Tao, Wang, Liejun, Yu, Yinfeng

arXiv.org Artificial Intelligence

-- Self-supervised learning has demonstrated impressive performance in speech tasks, yet there remains ample opportunity for advancement in the realm of speech enhancement research. In addressing speech tasks, confining the attention mechanism solely to the temporal dimension poses limitations in effectively focusing on critical speech features. Considering the aforementioned issues, our study introduces a novel speech enhancement framework, HFSDA, which skillfully integrates heterogeneous spatial features and incorporates a dual-dimension attention mechanism to significantly enhance speech clarity and quality in noisy environments. Furthermore, we employ the innovative Omni-dimensional Dynamic Convolution (ODConv) technology within the spectrogram input branch, enabling enhanced extraction and integration of crucial information across multiple dimensions. Additionally, we refine the Conformer model by enhancing its feature extraction capabilities not only in the temporal dimension but also across the spectral domain. Extensive experiments on the VCTK-DEMAND dataset show that HFSDA is comparable to existing state-of-the-art models, confirming the validity of our approach.


KernelWarehouse: Rethinking the Design of Dynamic Convolution

Li, Chao, Yao, Anbang

arXiv.org Artificial Intelligence

Dynamic convolution learns a linear mixture of n static kernels weighted with their input-dependent attentions, demonstrating superior performance than normal convolution. However, it increases the number of convolutional parameters by n times, and thus is not parameter efficient. This leads to no research progress that can allow researchers to explore the setting n>100 (an order of magnitude larger than the typical setting n<10) for pushing forward the performance boundary of dynamic convolution while enjoying parameter efficiency. To fill this gap, in this paper, we propose KernelWarehouse, a more general form of dynamic convolution, which redefines the basic concepts of ``kernels", ``assembling kernels" and ``attention function" through the lens of exploiting convolutional parameter dependencies within the same layer and across neighboring layers of a ConvNet. We testify the effectiveness of KernelWarehouse on ImageNet and MS-COCO datasets using various ConvNet architectures. Intriguingly, KernelWarehouse is also applicable to Vision Transformers, and it can even reduce the model size of a backbone while improving the model accuracy. For instance, KernelWarehouse (n=4) achieves 5.61%|3.90%|4.38% absolute top-1 accuracy gain on the ResNet18|MobileNetV2|DeiT-Tiny backbone, and KernelWarehouse (n=1/4) with 65.10% model size reduction still achieves 2.29% gain on the ResNet18 backbone. The code and models are available at https://github.com/OSVAI/KernelWarehouse.


Omni-Dimensional Dynamic Convolution

Li, Chao, Zhou, Aojun, Yao, Anbang

arXiv.org Artificial Intelligence

Instead, recent research in dynamic convolution shows that learning a linear combination of n convolutional kernels weighted with their input-dependent attentions can significantly improve the accuracy of light-weight CNNs, while maintaining efficient inference. However, we observe that existing works endow convolutional kernels with the dynamic property through one dimension (regarding the convolutional kernel number) of the kernel space, but the other three dimensions (regarding the spatial size, the input channel number and the output channel number for each convolutional kernel) are overlooked. Inspired by this, we present Omni-dimensional Dynamic Convolution (ODConv), a more generalized yet elegant dynamic convolution design, to advance this line of research. ODConv leverages a novel multi-dimensional attention mechanism with a parallel strategy to learn complementary attentions for convolutional kernels along all four dimensions of the kernel space at any convolutional layer. As a drop-in replacement of regular convolutions, ODConv can be plugged into many CNN architectures. Extensive experiments on the ImageNet and MS-COCO datasets show that OD-Conv brings solid accuracy boosts for various prevailing CNN backbones including both light-weight and large ones, e.g., 3.77% 5.71%|1.86% Intriguingly, thanks to its improved feature learning ability, ODConv with even one single kernel can compete with or outperform existing dynamic convolution counterparts with multiple kernels, substantially reducing extra parameters. Furthermore, ODConv is also superior to other attention modules for modulating the output features or the convolutional weights. Code and models are available at https://github.com/OSVAI/ODConv. In the past decade, we have witnessed the tremendous success of deep Convolutional Neural Networks (CNNs) in many computer vision applications (Krizhevsky et al., 2012; Girshick et al., 2014; Long et al., 2015; He et al., 2017). The most common way of constructing a deep CNN is to stack a number of convolutional layers as well as other basic layers organized with the predefined feature connection topology. Along with great advances in CNN architecture design by manual engineering (Krizhevsky et al., 2012; He et al., 2016; Howard et al., 2017) and automatic searching (Zoph & Le, 2017; Pham et al., 2018; Howard et al., 2019), lots of prevailing classification backbones have been presented. Recent works (Wang et al., 2017; Hu et al., 2018b; Park et al., 2018; Woo et al., 2018; Yang et al., 2019; Chen et al., 2020) show that incorporating attention mechanisms into convolutional blocks can further push the performance boundaries of modern CNNs, and thus it has attracted great research interest in the deep learning community.