Li, Zhuo (The Hong Kong Polytechnic University) | Wang, Hongwei (Shanghai Jiao Tong University) | Zhao, Miao (The Hong Kong Polytechnic University) | Li, Wenjie (The Hong Kong Polytechnic University) | Guo, Minyi (Shanghai Jiao Tong University)
Monaural source separation (MSS) aims to extract and reconstruct different sources from a single-channel mixture, which could facilitate a variety of applications such as chord recognition, pitch estimation and automatic transcription. In this paper, we study the problem of separating vocals and instruments from monaural music mixture. Existing works for monaural source separation either utilize linear and shallow models (e.g., non-negative matrix factorization), or do not explicitly address the coupling and tangling of multiple sources in original input signals, hence they do not perform satisfactorily in real-world scenarios. To overcome the above limitations, we propose a novel end-to-end framework for monaural music mixture separation called Deep Representation-Decoupling Neural Networks (DRDNN). DRDNN takes advantages of both traditional signal processing methods and popular deep learning models. For each input of music mixture, DRDNN converts it to a two-dimensional time-frequency spectrogram using short-time Fourier transform (STFT), followed by stacked convolutional neural networks (CNN) layers and long-short term memory (LSTM) layers to extract more condensed features. Afterwards, DRDNN utilizes a decoupling component, which consists of a group of multi-layer perceptrons (MLP), to decouple the features further into different separated sources. The design of decoupling component in DRDNN produces purified single-source signals for subsequent full-size restoration, and can significantly improve the performance of final separation. Through extensive experiments on real-world dataset, we prove that DRDNN outperforms state-of-the-art baselines in the task of monaural music mixture separation and reconstruction.
Traditional vision-based hand gesture recognition systems is limited under dark circumstances. In this paper, we build a hand gesture recognition system based on microwave transceiver and deep learning algorithm. A Doppler radar sensor with dual receiving channels at 5.8GHz is used to acquire a big database of hand gestures signals. The received hand gesture signals are then processed with time-frequency analysis. Based on these big databases of hand gesture, we propose a new machine learning architecture called deformable deep convolutional generative adversarial network. Experimental results show the new architecture can upgrade the recognition rate by 10% and the deformable kernel can reduce the testing time cost by 30%.
Inducing symmetry equivariance in deep neural architectures has resolved into improved data efficiency and generalization. In this work, we utilize the concept of scale and translation equivariance to tackle the problem of learning on time-series from raw waveforms. As a result, we obtain representations that largely resemble those of the wavelet transform at the first layer, but that evolve into much more descriptive ones as a function of depth. Our empirical results support the suitability of our Wavelet Networks which with a simple architecture design perform consistently better than CNNs on raw waveforms and on par with spectrogram-based methods.
Musical Source Separation (MSS) is a signal processing task that tries to separate the mixed musical signal into each acoustic sound source, such as singing voice or drums. Recently many machine learning-based methods have been proposed for the MSS task, but there were no existing works that evaluate and directly compare various types of networks. In this paper, we aim to design a variety of neural transformation methods, including time-invariant methods, time-frequency methods, and mixtures of two different transformations. Our experiments provide abundant material for future works by comparing several transformation methods. We train our models on raw complex-valued STFT outputs and achieve state-of-the-art SDR performance on the MUSDB singing voice separation task by a large margin of 1.0 dB. 1 Introduction For a given mixed musical signal composed of several instrumental sounds, Musical Source Separation (MSS) is a signal processing task that tries to separate the mixture source into each acoustic sound source, such as singing voice or drums.
Anomalous activity recognition deals with identifying the patterns and events that vary from the normal stream. In a surveillance paradigm, these events range from abuse to fighting and road accidents to snatching, etc. Due to the sparse occurrence of anomalous events, anomalous activity recognition from surveillance videos is a challenging research task. The approaches reported can be generally categorized as handcrafted and deep learning-based. Most of the reported studies address binary classification i.e. anomaly detection from surveillance videos. But these reported approaches did not address other anomalous events e.g. abuse, fight, road accidents, shooting, stealing, vandalism, and robbery, etc. from surveillance videos. Therefore, this paper aims to provide an effective framework for the recognition of different real-world anomalies from videos. This study provides a simple, yet effective approach for learning spatiotemporal features using deep 3-dimensional convolutional networks (3D ConvNets) trained on the University of Central Florida (UCF) Crime video dataset. Firstly, the frame-level labels of the UCF Crime dataset are provided, and then to extract anomalous spatiotemporal features more efficiently a fine-tuned 3D ConvNets is proposed. Findings of the proposed study are twofold 1)There exist specific, detectable, and quantifiable features in UCF Crime video feed that associate with each other 2) Multiclass learning can improve generalizing competencies of the 3D ConvNets by effectively learning frame-level information of dataset and can be leveraged in terms of better results by applying spatial augmentation.