Goto

Collaborating Authors

 Pattern Recognition


Understanding with toy surrogate models in machine learning

arXiv.org Artificial Intelligence

Unlike regular models, these very simple models--often referred to as toy models--are not required to be linked to the real world through structural similarity or resemblance relations. They are not meant to be approximations of the target world system, and in some cases, they are not even required to be representational. In semantic terms, they do not accurately map onto their targets. Despite these limitations, they are still useful in understanding theoretical concepts and possible configurations of the target system. Paradigmatic examples of toy models include Boyle's law and the Ising model in physics, the Lotka-Volterra model in population ecology, and the Schelling model in the social sciences (Weisberg, 2013). In recent years, philosophers of science have become interested in toy models (Grüne-Yanoff, 2009; Luczak, 2017; Reutlinger et al., 2018; Frigg & Nguyen, 2017; Nguyen, 2020). The main purpose of this literature is to explore the nature of these models and examine how they perform their epistemic function. Despite lacking the regular descriptive and predictive features of full-scale scientific models, they often offer an elementary understanding of a phenomenon. Their definitions of "toy model" differ as well as their assessment of the importance of representation in modelling generally, but they all agree that toy models play an important epistemic role in scientific research, exploration, and pedagogy.


Reviews: Bilevel Distance Metric Learning for Robust Image Recognition

Neural Information Processing Systems

Summary: The authors propose a bilevel method for metric learning, where the lower level is responsible for the extraction of discriminative features from the data based on a sparse coding scheme with graph regularization. This effectively detects their underlying geometric structure, and the upper level is a classic metric learning approach that utilizes the learned sparse coefficients. These two components are integrated into a joint optimization problem and an efficient optimization algorithm is developed accordingly. Hence, new data can be classified based on the learned dictionary and the corresponding metric. In the experiments the authors demonstrate the capabilities of the model to provide more discriminative features from high dimensional data, while being more robust to noise.


Reviews: A Simple Cache Model for Image Recognition

Neural Information Processing Systems

This paper presents a cache model to be used in image recognition tasks. The authors argue that class specific information can be retrieved from earlier layers of the network to improve the accuracy of an already trained model, without having to re-train of finetune. This is achieved by extracting and caching the activations of some layers along with the class at training time. At test time a similarity measure is used to calculate how far/close the input is compared to information stored in memory. Experiments show that performance is improved in CIFAR 10/100 and ImageNet.


Masked Autoencoder with Swin Transformer Network for Mitigating Electrode Shift in HD-EMG-based Gesture Recognition

arXiv.org Artificial Intelligence

Multi-channel surface Electromyography (sEMG), also referred to as high-density sEMG (HD-sEMG), plays a crucial role in improving gesture recognition performance for myoelectric control. Pattern recognition models developed based on HD-sEMG, however, are vulnerable to changing recording conditions (e.g., signal variability due to electrode shift). This has resulted in significant degradation in performance across subjects, and sessions. In this context, the paper proposes the Masked Autoencoder with Swin Transformer (MAST) framework, where training is performed on a masked subset of HDsEMG channels. A combination of four masking strategies, i.e., random block masking; temporal masking; sensor-wise random masking, and; multi-scale masking, is used to learn latent representations and increase robustness against electrode shift. The masked data is then passed through MAST's three-path encoder-decoder structure, leveraging a multi-path Swin-Unet architecture that simultaneously captures time-domain, frequency-domain, and magnitude-based features of the underlying HD-sEMG signal. These augmented inputs are then used in a self-supervised pre-training fashion to improve the model's generalization capabilities. Experimental results demonstrate the superior performance of the proposed MAST framework in comparison to its counterparts.


A large-scale operational study of fingerprint quality and demographics

arXiv.org Artificial Intelligence

Abstract--Even though a few initial works have shown on small sets of data some level of bias in the performance of fingerprint recognition technology with respect to certain demographic groups, there is still not sufficient evidence to understand the impact that certain factors such as gender, age or finger-type may have on fingerprint quality and, in turn, also on fingerprint matching accuracy. The present work addresses this still under researched topic, on a large-scale database of operational data containing 10-print impressions of almost 16,000 subjects. The results reached provide further insight into the dependency of fingerprint quality and demographics, and show that there in fact exists a certain degree of performance variability in fingerprint-based recognition systems for different segments of the population. Based on the experimental evaluation, the work points out new observations based on data-driven evidence, provides plausible hypotheses to explain such observations, and concludes with potential follow-up actions that can help to reduce the observed fingerprint quality differences. This way, the current paper can be considered as a contribution to further increase the algorithmic fairness and equality of biometric technology. "It's not the size of the dog in the fight, it's the size of demographic group, why do some segments of the population the fight in the dog" - Mark Twain However, with the exception of a few studies, comprise more information than those of young children or this inconsistency in the recognition rates has been mainly elders? Why do each of the fingers (including the thumb) of observed on small-to-medium databases under laboratory the hand provide different accuracy performance in fingerprint conditions and, therefore, it is difficult to quantify to what recognition systems?


Understanding and Mitigating Miscalibration in Prompt Tuning for Vision-Language Models

arXiv.org Artificial Intelligence

Confidence calibration is critical for the safe deployment of machine learning models in the real world. However, such issue in vision-language models like CLIP, particularly after fine-tuning, has not been fully addressed. In this work, we demonstrate that existing prompt tuning methods usually lead to a trade-off of calibration between base and new classes: the cross-entropy loss in CoOp causes overconfidence in new classes by increasing textual label divergence, whereas the regularization of KgCoOp maintains the confidence level but results in underconfidence in base classes due to the improved accuracy. Inspired by the observations, we introduce Dynamic Outlier Regularization (DOR) to ensure the confidence calibration on both base and new classes after fine-tuning. In particular, we propose to minimize the feature deviation of novel textual labels (instead of base classes) sampled from a large vocabulary. In effect, DOR prevents the increase in textual divergence for new labels while easing restrictions on base classes. Extensive experiments demonstrate that DOR can enhance the calibration performance of current fine-tuning methods on base and new classes. Large pre-trained vision-language models (VLMs) like CLIP (Radford et al., 2021) have become the de facto standard in today's zero-shot tasks including image recognition (Wortsman et al., 2022), open-vocabulary segmentation (Liang et al., 2023) and knowledge-augmented retrieval (Ming & Li, 2024). To transfer pre-trained CLIP knowledge to domain-specific downstream tasks efficiently, various parameter-efficient fine-tuning (PEFT) techniques including prompt tuning (Zhou et al., 2022b) and adapter (Gao et al., 2024) have been proposed. Despite the promising improvement in accuracy, the reliability issue such as confidence calibration in fine-tuned VLMs has been largely overlooked. Without fully understanding the miscalibration in fine-tuned VLMs, it can exacerbate safety concerns in high-stakes applications like medical diagnosis and autonomous driving.


HATFormer: Historic Handwritten Arabic Text Recognition with Transformers

arXiv.org Artificial Intelligence

Arabic handwritten text recognition (HTR) is challenging, especially for historical texts, due to diverse writing styles and the intrinsic features of Arabic script. Additionally, Arabic handwriting datasets are smaller compared to English ones, making it difficult to train generalizable Arabic HTR models. To address these challenges, we propose HATFormer, a transformer-based encoder-decoder architecture that builds on a state-of-the-art English HTR model. By leveraging the transformer's attention mechanism, HATFormer captures spatial contextual information to address the intrinsic challenges of Arabic script through differentiating cursive characters, decomposing visual representations, and identifying diacritics. Our customization to historical handwritten Arabic includes an image processor for effective ViT information preprocessing, a text tokenizer for compact Arabic text representation, and a training pipeline that accounts for a limited amount of historic Arabic handwriting data. HATFormer achieves a character error rate (CER) of 8.6% on the largest public historical handwritten Arabic dataset, with a 51% improvement over the best baseline in the literature. HATFormer also attains a comparable CER of 4.2% on the largest private non-historical dataset. Our work demonstrates the feasibility of adapting an English HTR method to a low-resource language with complex, language-specific challenges, contributing to advancements in document digitization, information retrieval, and cultural preservation.


Knowledge Discovery using Unsupervised Cognition

arXiv.org Artificial Intelligence

Knowledge discovery is key to understand and interpret a dataset, as well as to find the underlying relationships between its components. Unsupervised Cognition is a novel unsupervised learning algorithm that focus on modelling the learned data. This paper presents three techniques to perform knowledge discovery over an already trained Unsupervised Cognition model. Specifically, we present a technique for pattern mining, a technique for feature selection based on the previous pattern mining technique, and a technique for dimensionality reduction based on the previous feature selection technique. The final goal is to distinguish between relevant and irrelevant features and use them to build a model from which to extract meaningful patterns. We evaluated our proposals with empirical experiments and found that they overcome the state-of-the-art in knowledge discovery.


JaPOC: Japanese Post-OCR Correction Benchmark using Vouchers

arXiv.org Artificial Intelligence

In this paper, we create benchmarks and assess the effectiveness of error correction methods for Japanese vouchers in OCR (Optical Character Recognition) systems. It is essential for automation processing to correctly recognize scanned voucher text, such as the company name on invoices. However, perfect recognition is complex due to the noise, such as stamps. Therefore, it is crucial to correctly rectify erroneous OCR results. However, no publicly available OCR error correction benchmarks for Japanese exist, and methods have not been adequately researched. In this study, we measured text recognition accuracy by existing services on Japanese vouchers and developed a post-OCR correction benchmark. Then, we proposed simple baselines for error correction using language models and verified whether the proposed method could effectively correct these errors. In the experiments, the proposed error correction algorithm significantly improved overall recognition accuracy.


Gesture Recognition for Feedback Based Mixed Reality and Robotic Fabrication: A Case Study of the UnLog Tower

arXiv.org Artificial Intelligence

Mixed Reality (MR) platforms enable users to interact with three-dimensional holographic instructions during the assembly and fabrication of highly custom and parametric architectural constructions without the necessity of two-dimensional drawings. Previous MR fabrication projects have primarily relied on digital menus and custom buttons as the interface for user interaction with the MR environment. Despite this approach being widely adopted, it is limited in its ability to allow for direct human interaction with physical objects to modify fabrication instructions within the MR environment. This research integrates user interactions with physical objects through real-time gesture recognition as input to modify, update or generate new digital information enabling reciprocal stimuli between the physical and the virtual environment. Consequently, the digital environment is generative of the user's provided interaction with physical objects to allow seamless feedback in the fabrication process. This research investigates gesture recognition for feedback-based MR workflows for robotic fabrication, human assembly, and quality control in the construction of the UnLog Tower.