McLoughlin, Ian
On-Device LLMs for SMEs: Challenges and Opportunities
Yee, Jeremy Stephen Gabriel, Ng, Pai Chet, Wang, Zhengkui, McLoughlin, Ian, Ng, Aik Beng, See, Simon
This paper presents a systematic review of the infrastructure requirements for deploying Large Language Models (LLMs) on-device within the context of small and medium-sized enterprises (SMEs), focusing on both hardware and software perspectives. From the hardware viewpoint, we discuss the utilization of processing units like GPUs and TPUs, efficient memory and storage solutions, and strategies for effective deployment, addressing the challenges of limited computational resources typical in SME settings. From the software perspective, we explore framework compatibility, operating system optimization, and the use of specialized libraries tailored for resource-constrained environments. The review is structured to first identify the unique challenges faced by SMEs in deploying LLMs on-device, followed by an exploration of the opportunities that both hardware innovations and software adaptations offer to overcome these obstacles. Such a structured review provides practical insights, contributing significantly to the community by enhancing the technological resilience of SMEs in integrating LLMs.
Prototype based Masked Audio Model for Self-Supervised Learning of Sound Event Detection
Cai, Pengfei, Song, Yan, Jiang, Nan, Gu, Qing, McLoughlin, Ian
A significant challenge in sound event detection (SED) is the effective utilization of unlabeled data, given the limited availability of labeled data due to high annotation costs. Semi-supervised algorithms rely on labeled data to learn from unlabeled data, and the performance is constrained by the quality and size of the former. In this paper, we introduce the Prototype based Masked Audio Model~(PMAM) algorithm for self-supervised representation learning in SED, to better exploit unlabeled data. Specifically, semantically rich frame-level pseudo labels are constructed from a Gaussian mixture model (GMM) based prototypical distribution modeling. These pseudo labels supervise the learning of a Transformer-based masked audio model, in which binary cross-entropy loss is employed instead of the widely used InfoNCE loss, to provide independent loss contributions from different prototypes, which is important in real scenarios in which multiple labels may apply to unsupervised data frames. A final stage of fine-tuning with just a small amount of labeled data yields a very high performing SED model. On like-for-like tests using the DESED task, our method achieves a PSDS1 score of 62.5\%, surpassing current state-of-the-art models and demonstrating the superiority of the proposed technique.
Paraformer: Fast and Accurate Parallel Transformer for Non-autoregressive End-to-End Speech Recognition
Gao, Zhifu, Zhang, Shiliang, McLoughlin, Ian, Yan, Zhijie
Transformers have recently dominated the ASR field. Although able to yield good performance, they involve an autoregressive (AR) decoder to generate tokens one by one, which is computationally inefficient. To speed up inference, non-autoregressive (NAR) methods, e.g. single-step NAR, were designed, to enable parallel generation. However, due to an independence assumption within the output tokens, performance of single-step NAR is inferior to that of AR models, especially with a large-scale corpus. There are two challenges to improving single-step NAR: Firstly to accurately predict the number of output tokens and extract hidden variables; secondly, to enhance modeling of interdependence between output tokens. To tackle both challenges, we propose a fast and accurate parallel transformer, termed Paraformer. This utilizes a continuous integrate-and-fire based predictor to predict the number of tokens and generate hidden variables. A glancing language model (GLM) sampler then generates semantic embeddings to enhance the NAR decoder's ability to model context interdependence. Finally, we design a strategy to generate negative samples for minimum word error rate training to further improve performance. Experiments using the public AISHELL-1, AISHELL-2 benchmark, and an industrial-level 20,000 hour task demonstrate that the proposed Paraformer can attain comparable performance to the state-of-the-art AR transformer, with more than 10x speedup.
A Light-weight Deep Learning Model for Remote Sensing Image Classification
Pham, Lam, Le, Cam, Ngo, Dat, Nguyen, Anh, Lampert, Jasmin, Schindler, Alexander, McLoughlin, Ian
In this paper, we present a high-performance and light-weight deep learning model for Remote Sensing Image Classification (RSIC), the task of identifying the aerial scene of a remote sensing image. To this end, we first valuate various benchmark convolutional neural network (CNN) architectures: MobileNet V1/V2, ResNet 50/151V2, InceptionV3/InceptionResNetV2, EfficientNet B0/B7, DenseNet 121/201, ConNeXt Tiny/Large. Then, the best performing models are selected to train a compact model in a teacher-student arrangement. The knowledge distillation from the teacher aims to achieve high performance with significantly reduced complexity. By conducting extensive experiments on the NWPU-RESISC45 benchmark, our proposed teacher-student models outperforms the state-of-the-art systems, and has potential to be applied on a wide rage of edge devices.
Towards More Accurate Automatic Sleep Staging via Deep Transfer Learning
Phan, Huy, Chรฉn, Oliver Y., Koch, Philipp, Lu, Zongqing, McLoughlin, Ian, Mertins, Alfred, De Vos, Maarten
Abstract--Although large annotated sleep databases are publicly available, and might be used to train automated scorin g algorithms, it might still be a challenge to develop an optim al algorithm for your personal sleep study, which might have fe w subjects or rely on a different recording setup. Both direct ly applying a learned algorithm or retraining the algorithm on your rather small database is suboptimal. And definitely sta te-of- the-art sleep staging algorithms based on deep neural netwo rks demand a large amount of data to be trained. This work present s a deep transfer learning approach to overcome the channel mismatch problem and enable transferring knowledge from a large dataset to a small cohort for automatic sleep staging. We start from a generic end-to-end deep learning framework for sequence-to-sequence sleep staging and derive two netw orks adhering to this framework as a device for transfer learning . The networks are first trained in the source domain (i.e. the large database). The pretrained networks are then finetuned in the target domain, i.e. the small cohort, to complete knowle dge transfer . We employ the Montreal Archive of Sleep Studies (MASS) database consisting of 200 subjects as the source dom ain and study deep transfer learning on four different target do - mains: the Sleep Cassette subset and the Sleep T elemetry sub set of the Sleep-EDF Expanded database, the Surrey-cEEGGrid database, and the Surrey-PSG database. The target domains are purposely adopted to cover different degrees of channel mismatch to the source domain. Our experimental results sho w significant performance improvement on automatic sleep sta ging on the target domains achieved with the proposed deep transf er learning approach and we discuss the impact of various fine tuning approaches. Index T erms --Automatic sleep staging, sequence-to-sequence, deep learning, transfer learning.
Spatio-Temporal Attention Pooling for Audio Scene Classification
Phan, Huy, Chรฉn, Oliver Y., Pham, Lam, Koch, Philipp, De Vos, Maarten, McLoughlin, Ian, Mertins, Alfred
Acoustic scenes are rich and redundant in their content. In Given the rich content of acoustic scenes, they typically this work, we present a spatiotemporal attention pooling layer contain a lot of irrelevant and redundant information. This fact coupled with a convolutional recurrent neural network to learn naturally gives rise to the question of how to encourage a deep from patterns that are discriminative while suppressing those learning model to automatically discover and focus on discriminative that are irrelevant for acoustic scene classification. The convolutional patterns and suppress irrelevant ones from the acoustic layers in this network learn invariant features from scenes for better classification. We seek to address that question time-frequency input. The bidirectional recurrent layers are in this work using an attention mechanism [15]. To this end, we then able to encode the temporal dynamics of the resulting convolutional propose a spatiotemporal attention pooling layer in combination features. Afterwards, a two-dimensional attention with a convolutional recurrent neural network (CRNN), inspired mask is formed via the outer product of the spatial and temporal by their success in the audio event detection task [16, 17].
Unifying Isolated and Overlapping Audio Event Detection with Multi-Label Multi-Task Convolutional Recurrent Neural Networks
Phan, Huy, Chรฉn, Oliver Y., Koch, Philipp, Pham, Lam, McLoughlin, Ian, Mertins, Alfred, De Vos, Maarten
We propose a multi-label multi-task framework based on a convolutional recurrent neural network to unify detection of isolated and overlapping audio events. The framework leverages the power of convolutional recurrent neural network architectures; convolutional layers learn effective features over which higher recurrent layers perform sequential modelling. Furthermore, the output layer is designed to handle arbitrary degrees of event overlap. At each time step in the recurrent output sequence, an output triple is dedicated to each event category of interest to jointly model event occurrence and temporal boundaries. That is, the network jointly determines whether an event of this category occurs, and when it occurs, by estimating onset and offset positions at each recurrent time step. We then introduce three sequential losses for network training: multi-label classification loss, distance estimation loss, and confidence loss. We demonstrate good generalization on two datasets: ITC-Irst for isolated audio event detection, and TUT-SED-Synthetic-2016 for overlapping audio event detection.
Learning Compact Structural Representations for Audio Events Using Regressor Banks
Phan, Huy, Maass, Marco, Hertel, Lars, Mazur, Radoslaw, McLoughlin, Ian, Mertins, Alfred
We introduce a new learned descriptor for audio signals which is efficient for event representation. The entries of the descriptor are produced by evaluating a set of regressors on the input signal. The regressors are class-specific and trained using the random regression forests framework. Given an input signal, each regressor estimates the onset and offset positions of the target event. The estimation confidence scores output by a regressor are then used to quantify how the target event aligns with the temporal structure of the corresponding category. Our proposed descriptor has two advantages. First, it is compact, i.e. the dimensionality of the descriptor is equal to the number of event classes. Second, we show that even simple linear classification models, trained on our descriptor, yield better accuracies on audio event classification task than not only the nonlinear baselines but also the state-of-the-art results.
Evolutionary Clustering and Analysis of User Behaviour in Online Forums
Morrison, Donn (Digital Enterprise Research Institute) | McLoughlin, Ian (Digital Enterprise Research Institute) | Hogan, Alice (Digital Enterprise Research Institute) | Hayes, Conor (Digital Enterprise Research Institute)
In this paper we cluster and analyse temporal user behaviour in online communities. We adapt a simple unsupervised clustering algorithm to an evolutionary setting where we cluster users into prototypical behavioural roles based on features derived from their ego-centric reply-graphs. We then analyse changes in the role membership of the users over time, the change in role composition of forums over time and examine the differences between forums in terms of role composition. We perform this analysis on 200 forums from a popular national bulletin board and 14 enterprise technical support forums.