Pattern Recognition
Certification of Speaker Recognition Models to Additive Perturbations
Korzh, Dmitrii, Karimov, Elvir, Pautov, Mikhail, Rogov, Oleg Y., Oseledets, Ivan
Speaker recognition technology is applied in various tasks ranging from personal virtual assistants to secure access systems. However, the robustness of these systems against adversarial attacks, particularly to additive perturbations, remains a significant challenge. In this paper, we pioneer applying robustness certification techniques to speaker recognition, originally developed for the image domain. In our work, we cover this gap by transferring and improving randomized smoothing certification techniques against norm-bounded additive perturbations for classification and few-shot learning tasks to speaker recognition. We demonstrate the effectiveness of these methods on VoxCeleb 1 and 2 datasets for several models. We expect this work to improve voice-biometry robustness, establish a new certification benchmark, and accelerate research of certification methods in the audio domain.
Improving Automatic Text Recognition with Language Models in the PyLaia Open-Source Library
Tarride, Solรจne, Schneider, Yoann, Generali-Lince, Marie, Boillet, Mรฉlodie, Abadie, Bastien, Kermorvant, Christopher
PyLaia is one of the most popular open-source software for Automatic Text Recognition (ATR), delivering strong performance in terms of speed and accuracy. In this paper, we outline our recent contributions to the PyLaia library, focusing on the incorporation of reliable confidence scores and the integration of statistical language modeling during decoding. Our implementation provides an easy way to combine PyLaia with n-grams language models at different levels. One of the highlights of this work is that language models are completely auto-tuned: they can be built and used easily without any expert knowledge, and without requiring any additional data. To demonstrate the significance of our contribution, we evaluate PyLaia's performance on twelve datasets, both with and without language modelling. The results show that decoding with small language models improves the Word Error Rate by 13% and the Character Error Rate by 12% in average. Additionally, we conduct an analysis of confidence scores and highlight the importance of calibration techniques.
Mining patterns in syntax trees to automate code reviews of student solutions for programming exercises
Van Petegem, Charlotte, Demeyere, Kasper, Maertens, Rien, Strijbol, Niko, De Wever, Bram, Mesuere, Bart, Dawyndt, Peter
In programming education, providing manual feedback is essential but labour-intensive, posing challenges in consistency and timeliness. We introduce ECHO, a machine learning method to automate the reuse of feedback in educational code reviews by analysing patterns in abstract syntax trees. This study investigates two primary questions: whether ECHO can predict feedback annotations to specific lines of student code based on previously added annotations by human reviewers (RQ1), and whether its training and prediction speeds are suitable for using ECHO for real-time feedback during live code reviews by human reviewers (RQ2). Our results, based on annotations from both automated linting tools and human reviewers, show that ECHO can accurately and quickly predict appropriate feedback annotations. Its efficiency in processing and its flexibility in adapting to feedback patterns can significantly reduce the time and effort required for manual feedback provisioning in educational settings.
Federated Learning with Only Positive Labels by Exploring Label Correlations
An, Xuming, Wang, Dui, Shen, Li, Luo, Yong, Hu, Han, Du, Bo, Wen, Yonggang, Tao, Dacheng
This approach, however, treats different labels equally Federated learning (FL) [1] is a novel machine learning in the spreadout (class embedding separation) process. That paradigm that trains an algorithm across multiple decentralized is, embeddings of class labels that are highly correlated and clients (such as edge devices) or servers without exchanging significantly different in multiple labels' space are separated in local data samples. Since clients can only access the local the same way. This is not reasonable since embeddings should datasets, the user's privacy can be well protected, and this be close for correlated labels, and dissimilar otherwise. For paradigm has attracted increasing attention in recent years [2]- example, we assume that the class labels'Desktop computer' [4]. In this paper, we study the challenge problem of learning and'Desk' often appear in the same instance, thus these two a multi-label classification model [5], [6] under the federated corresponding class embedding vectors can be deemed highcorrelation learning setting, where each user has only local positive data and may be relatively close compared with others, related to a single class label [7]. This setting can be treated such as class labels'aircraft', 'automobile', etc. Besides, since as the extremely label-skew case in the data heterogeneity of the instance and class embeddings are trained on clients and federated learning, which is popular in real-world applications.
GatedLexiconNet: A Comprehensive End-to-End Handwritten Paragraph Text Recognition System
Kumari, Lalita, Singh, Sukhdeep, Rathore, Vaibhav Varish Singh, Sharma, Anuj
The Handwritten Text Recognition problem has been a challenge for researchers for the last few decades, especially in the domain of computer vision, a subdomain of pattern recognition. Variability of texts amongst writers, cursiveness, and different font styles of handwritten texts with degradation of historical text images make it a challenging problem. Recognizing scanned document images in neural network-based systems typically involves a two-step approach: segmentation and recognition. However, this method has several drawbacks. These shortcomings encompass challenges in identifying text regions, analyzing layout diversity within pages, and establishing accurate ground truth segmentation. Consequently, these processes are prone to errors, leading to bottlenecks in achieving high recognition accuracies. Thus, in this study, we present an end-to-end paragraph recognition system that incorporates internal line segmentation and gated convolutional layers based encoder. The gating is a mechanism that controls the flow of information and allows to adaptively selection of the more relevant features in handwritten text recognition models. The attention module plays an important role in performing internal line segmentation, allowing the page to be processed line-by-line. During the decoding step, we have integrated a connectionist temporal classification-based word beam search decoder as a post-processing step. In this work, we have extended existing LexiconNet by carefully applying and utilizing gated convolutional layers in the existing deep neural network. Our results at line and page levels also favour our new GatedLexiconNet. This study reported character error rates of 2.27% on IAM, 0.9% on RIMES, and 2.13% on READ-16, and word error rates of 5.73% on IAM, 2.76% on RIMES, and 6.52% on READ-2016 datasets.
Improving Chinese Character Representation with Formation Tree
Hong, Yang, Li, Yinfei, Qiao, Xiaojun, Li, Rui, Zhang, Junsong
Learning effective representations for Chinese characters presents unique challenges, primarily due to the vast number of characters and their continuous growth, which requires models to handle an expanding category space. Additionally, the inherent sparsity of character usage complicates the generalization of learned representations. Prior research has explored radical-based sequences to overcome these issues, achieving progress in recognizing unseen characters. However, these approaches fail to fully exploit the inherent tree structure of such sequences. To address these limitations and leverage established data properties, we propose Formation Tree-CLIP (FT-CLIP). This model utilizes formation trees to represent characters and incorporates a dedicated tree encoder, significantly improving performance in both seen and unseen character recognition tasks. We further introduce masking for to both character images and tree nodes, enabling efficient and effective training. This approach accelerates training significantly (by a factor of 2 or more) while enhancing accuracy. Extensive experiments show that processing characters through formation trees aligns better with their inherent properties than direct sequential methods, significantly enhancing the generality and usability of the representations.
Revisiting Noise Resilience Strategies in Gesture Recognition: Short-Term Enhancement in Surface Electromyographic Signal Analysis
Guo, Weiyu, Qiao, Ziyue, Sun, Ying, Xiong, Hui
Gesture recognition based on surface electromyography (sEMG) has been gaining importance in many 3D Interactive Scenes. However, sEMG is easily influenced by various forms of noise in real-world environments, leading to challenges in providing long-term stable interactions through sEMG. Existing methods often struggle to enhance model noise resilience through various predefined data augmentation techniques. In this work, we revisit the problem from a short term enhancement perspective to improve precision and robustness against various common noisy scenarios with learnable denoise using sEMG intrinsic pattern information and sliding-window attention. We propose a Short Term Enhancement Module(STEM) which can be easily integrated with various models. STEM offers several benefits: 1) Learnable denoise, enabling noise reduction without manual data augmentation; 2) Scalability, adaptable to various models; and 3) Cost-effectiveness, achieving short-term enhancement through minimal weight-sharing in an efficient attention mechanism. In particular, we incorporate STEM into a transformer, creating the Short Term Enhanced Transformer (STET). Compared with best-competing approaches, the impact of noise on STET is reduced by more than 20%. We also report promising results on both classification and regression datasets and demonstrate that STEM generalizes across different gesture recognition tasks.
Spatial Context-based Self-Supervised Learning for Handwritten Text Recognition
Penarrubia, Carlos, Garrido-Munoz, Carlos, Valero-Mas, Jose J., Calvo-Zaragoza, Jorge
Handwritten text recognition (HTR) is the research area in the field of computer vision whose objective is to transcribe the textual content of a written manuscript into a digital machine-readable format [73]. This field not only plays a key role in the current digital era of handwriting by electronic means (such as tablets) [11], but is also of paramount relevance for the preservation, indexing and dissemination of historical manuscripts that exist solely in a physical format [56]. HTR has developed considerably over the last decade owing to the emergence of Deep Learning [57], which has greatly increased its performance. However, in order to attain competitive results, these solutions usually require large volumes of manually-labelled data, which is the principal bottleneck of this method. One means by which to alleviate this problem, Self-Supervised Learning (SSL), has recently gained considerable attention from the research community [61]. SSL employs what is termed as a pretext task to leverage collections of unlabelled data for the training of neural models in order to obtain descriptive and intelligible representations [8], thus reducing the need for large amounts of labelled data [4]. The pretext tasks can be framed in different categories according to their working principle [34, 61], with the following being some of the main existing families: (i) image generation strategies [63, 46], which focus on recovering the original distribution of the data from defined distortions or corruptions; (ii) contrastive learning methods [60, 33], whose objective is to learn representative and discernible codifications of the data, and (iii) spatial context methods [27, 58], which focus on either estimating geometric transformations performed on the data [27]--i.e.
MathWriting: A Dataset For Handwritten Mathematical Expression Recognition
Gervais, Philippe, Fadeeva, Asya, Maksai, Andrii
Online text recognition models have improved a lot in the past few years, because of improvements in model structure and also because of bigger datasets. Mathematical expression (ME) recognition is a more complex task that has not received as much attention. However, the problem is different from text recognition in a number of interesting ways which can prevent improvements on one transfering to the other. Though MEs share with text most of their symbols, they follow a more rigid structure which is also two-dimensional. Where text can be treated to some extent as a one-dimensional problem amenable to sequence modeling, MEs cannot, because the relative position of symbols in space is meaningful.
Integration of Self-Supervised BYOL in Semi-Supervised Medical Image Recognition
Feng, Hao, Jia, Yuanzhe, Xu, Ruijia, Prasad, Mukesh, Anaissi, Ali, Braytee, Ali
Image recognition techniques heavily rely on abundant labeled data, particularly in medical contexts. Addressing the challenges associated with obtaining labeled data has led to the prominence of self-supervised learning and semi-supervised learning, especially in scenarios with limited annotated data. In this paper, we proposed an innovative approach by integrating self-supervised learning into semi-supervised models to enhance medical image recognition. Our methodology commences with pre-training on unlabeled data utilizing the BYOL method. Subsequently, we merge pseudo-labeled and labeled datasets to construct a neural network classifier, refining it through iterative fine-tuning. Experimental results on three different datasets demonstrate that our approach optimally leverages unlabeled data, outperforming existing methods in terms of accuracy for medical image recognition.