Liu, Kun
Exploring the Potential of Large Multimodal Models as Effective Alternatives for Pronunciation Assessment
Wang, Ke, He, Lei, Liu, Kun, Deng, Yan, Wei, Wenning, Zhao, Sheng
Large Multimodal Models (LMMs) have demonstrated exceptional performance across a wide range of domains. This paper explores their potential in pronunciation assessment tasks, with a particular focus on evaluating the capabilities of the Generative Pre-trained Transformer (GPT) model, specifically GPT-4o. Our study investigates its ability to process speech and audio for pronunciation assessment across multiple levels of granularity and dimensions, with an emphasis on feedback generation and scoring. For our experiments, we use the publicly available Speechocean762 dataset. The evaluation focuses on two key aspects: multi-level scoring and the practicality of the generated feedback. Scoring results are compared against the manual scores provided in the Speechocean762 dataset, while feedback quality is assessed using Large Language Models (LLMs). The findings highlight the effectiveness of integrating LMMs with traditional methods for pronunciation assessment, offering insights into the model's strengths and identifying areas for further improvement.
KACDP: A Highly Interpretable Credit Default Prediction Model
Liu, Kun, Zhao, Jin
In today's financial field, individual credit risk prediction has become a crucial part in the risk management of financial institutions. Accurate default prediction can not only help financial institutions significantly reduce losses but also significantly improve the utilization rate of funds, thereby enhancing their competitiveness in the market [1] [2]. With the rapid development of financial technology, numerous machine learning and deep learning techniques are gradually being widely applied in credit risk assessment. However, the existing various methods inevitably expose certain limitations when dealing with high-dimensional and nonlinear data, among which the problems of interpretability and transparency are the most prominent [3]. Traditional credit risk prediction methods mainly include two categories: statistical models and machine learning models. The typical representative of statistical models, such as Logistic regression [4], has the advantage of being simple and easy to use. However, when dealing with complex data, due to relatively strict assumptions, it is often difficult to effectively capture nonlinear relationships. Machine learning models, such as Random Forest (RF) [5], Support Vector Machine (SVM) [6], and Extreme Gradient Boosting Machine (XGBoost) [7], although they perform relatively well in handling high-dimensional data, their interpretability is relatively poor and it is difficult to provide a clear and transparent decision-making process. Deep learning models, like Multi-Layer Perceptron (MLP) [8] and Recurrent Neural Network (RNN) [9], although they have strong expressive ability, in the practical application in the financial field, their black-box characteristics cause the model to severely lack transparency and interpretability, which undoubtedly becomes a major problem in the strictly regulated financial industry [10].
Towards Surveillance Video-and-Language Understanding: New Dataset, Baselines, and Challenges
Yuan, Tongtong, Zhang, Xuange, Liu, Kun, Liu, Bo, Chen, Chen, Jin, Jian, Jiao, Zhenzhen
Surveillance videos are an essential component of daily life with various critical applications, particularly in public security. However, current surveillance video tasks mainly focus on classifying and localizing anomalous events. Existing methods are limited to detecting and classifying the predefined events with unsatisfactory semantic understanding, although they have obtained considerable performance. To address this issue, we propose a new research direction of surveillance video-and-language understanding, and construct the first multimodal surveillance video dataset. We manually annotate the real-world surveillance dataset UCF-Crime with fine-grained event content and timing. Our newly annotated dataset, UCA (UCF-Crime Annotation), contains 23,542 sentences, with an average length of 20 words, and its annotated videos are as long as 110.7 hours. Furthermore, we benchmark SOTA models for four multimodal tasks on this newly created dataset, which serve as new baselines for surveillance video-and-language understanding. Through our experiments, we find that mainstream models used in previously publicly available datasets perform poorly on surveillance video, which demonstrates the new challenges in surveillance video-and-language understanding. To validate the effectiveness of our UCA, we conducted experiments on multimodal anomaly detection. The results demonstrate that our multimodal surveillance learning can improve the performance of conventional anomaly detection tasks. All the experiments highlight the necessity of constructing this dataset to advance surveillance AI. The link to our dataset is provided at: https://xuange923.github.io/Surveillance-Video-Understanding.
SigFormer: Sparse Signal-Guided Transformer for Multi-Modal Human Action Segmentation
Liu, Qi, Liu, Xinchen, Liu, Kun, Gu, Xiaoyan, Liu, Wu
Multi-modal human action segmentation is a critical and challenging task with a wide range of applications. Nowadays, the majority of approaches concentrate on the fusion of dense signals (i.e., RGB, optical flow, and depth maps). However, the potential contributions of sparse IoT sensor signals, which can be crucial for achieving accurate recognition, have not been fully explored. To make up for this, we introduce a Sparse signalguided Transformer (SigFormer) to combine both dense and sparse signals. We employ mask attention to fuse localized features by constraining cross-attention within the regions where sparse signals are valid. However, since sparse signals are discrete, they lack sufficient information about the temporal action boundaries. Therefore, in SigFormer, we propose to emphasize the boundary information at two stages to alleviate this problem. In the first feature extraction stage, we introduce an intermediate bottleneck module to jointly learn both category and boundary features of each dense modality through the inner loss functions. After the fusion of dense modalities and sparse signals, we then devise a two-branch architecture that explicitly models the interrelationship between action category and temporal boundary. Experimental results demonstrate that SigFormer outperforms the state-of-the-art approaches on a multi-modal action segmentation dataset from real industrial environments, reaching an outstanding F1 score of 0.958. The codes and pre-trained models have been available at https://github.com/LIUQI-creat/SigFormer.
Ball Mill Fault Prediction Based on Deep Convolutional Auto-Encoding Network
Ai, Xinkun, Liu, Kun, Zheng, Wei, Fan, Yonggang, Wu, Xinwu, Zhang, Peilong, Wang, LiYe, Zhu, JanFeng, Pan, Yuan
Ball mills play a critical role in modern mining operations, making their bearing failures a significant concern due to the potential loss of production efficiency and economic consequences. This paper presents an anomaly detection method based on Deep Convolutional Auto-encoding Neural Networks (DCAN) for addressing the issue of ball mill bearing fault detection. The proposed approach leverages vibration data collected during normal operation for training, overcoming challenges such as labeling issues and data imbalance often encountered in supervised learning methods. DCAN includes the modules of convolutional feature extraction and transposed convolutional feature reconstruction, demonstrating exceptional capabilities in signal processing and feature extraction. Additionally, the paper describes the practical deployment of the DCAN-based anomaly detection model for bearing fault detection, utilizing data from the ball mill bearings of Wuhan Iron & Steel Resources Group and fault data from NASA's bearing vibration dataset. Experimental results validate the DCAN model's reliability in recognizing fault vibration patterns. This method holds promise for enhancing bearing fault detection efficiency, reducing production interruptions, and lowering maintenance costs.
Language Guided Networks for Cross-modal Moment Retrieval
Liu, Kun, Ma, Huadong, Gan, Chuang
We address the challenging task of cross-modal moment retrieval, which aims to localize a temporal segment from an untrimmed video described by a natural language query. It poses great challenges over the proper semantic alignment between vision and linguistic domains. Existing methods independently extract the features of videos and sentences and purely utilize the sentence embedding in the multi-modal fusion stage, which do not make full use of the potential of language. In this paper, we present Language Guided Networks (LGN), a new framework that leverages the sentence embedding to guide the whole process of moment retrieval. In the first feature extraction stage, we propose to jointly learn visual and language features to capture the powerful visual information which can cover the complex semantics in the sentence query. Specifically, the early modulation unit is designed to modulate the visual feature extractor's feature maps by a linguistic embedding. Then we adopt a multi-modal fusion module in the second fusion stage. Finally, to get a precise localizer, the sentence information is utilized to guide the process of predicting temporal positions. Specifically, the late guidance module is developed to linearly transform the output of localization networks via the channel attention mechanism. The experimental results on two popular datasets demonstrate the superior performance of our proposed method on moment retrieval (improving by 5.8\% in terms of Rank1@IoU0.5 on Charades-STA and 5.2\% on TACoS). The source code for the complete system will be publicly available.
T-C3D: Temporal Convolutional 3D Network for Real-Time Action Recognition
Liu, Kun (Beijing University of Posts and Telecommunications) | Liu, Wu (Beijing University of Posts and Telecommunications) | Gan, Chuang (Tsinghua University) | Tan, Mingkui (South China University of Technology) | Ma, Huadong (Beijing University of Posts and Telecommunications)
Video-based action recognition with deep neural networks has shown remarkable progress. However, most of the existing approaches are too computationally expensive due to the complex network architecture. To address these problems, we propose a new real-time action recognition architecture, called Temporal Convolutional 3D Network (T-C3D), which learns video action representations in a hierarchical multi-granularity manner. Specifically, we combine a residual 3D convolutional neural network which captures complementary information on the appearance of a single frame and the motion between consecutive frames with a new temporal encoding method to explore the temporal dynamics of the whole video. Thus heavy calculations are avoided when doing the inference, which enables the method to be capable of real-time processing. On two challenging benchmark datasets, UCF101 and HMDB51, our method is significantly better than state-of-the-art real-time methods by over 5.4% in terms of accuracy and 2 times faster in terms of inference speed (969 frames per second), demonstrating comparable recognition performance to the state-of-the-art methods. The source code for the complete system as well as the pre-trained models are publicly available at https://github.com/tc3d.