Pattern Recognition
Agile gesture recognition for low-power applications: customisation for generalisation
Liu, Ying, Guo, Liucheng, Makarovc, Valeri A., Gorbana, Alexander, Mirkesa, Evgeny, Tyukin, Ivan Y.
Automated hand gesture recognition has long been a focal point in the AI community. Traditionally, research in this field has predominantly focused on scenarios with access to a continuous flow of hand's images. This focus has been driven by the widespread use of cameras and the abundant availability of image data. However, there is an increasing demand for gesture recognition technologies that operate on low-power sensor devices. This is due to the rising concerns for data leakage and end-user privacy, as well as the limited battery capacity and the computing power in low-cost devices. Moreover, the challenge in data collection for individually designed hardware also hinders the generalisation of a gesture recognition model. In this study, we unveil a novel methodology for pattern recognition systems using adaptive and agile error correction, designed to enhance the performance of legacy gesture recognition models on devices with limited battery capacity and computing power. This system comprises a compact Support Vector Machine as the base model for live gesture recognition. Additionally, it features an adaptive agile error corrector that employs few-shot learning within the feature space induced by high-dimensional kernel mappings. The error corrector can be customised for each user, allowing for dynamic adjustments to the gesture prediction based on their movement patterns while maintaining the agile performance of its base model on a low-cost and low-power micro-controller. This proposed system is distinguished by its compact size, rapid processing speed, and low power consumption, making it ideal for a wide range of embedded systems.
Classes Are Not Equal: An Empirical Study on Image Recognition Fairness
Cui, Jiequan, Zhu, Beier, Wen, Xin, Qi, Xiaojuan, Yu, Bei, Zhang, Hanwang
In this paper, we present an empirical study on image recognition fairness, i.e., extreme class accuracy disparity on balanced data like ImageNet. We experimentally demonstrate that classes are not equal and the fairness issue is prevalent for image classification models across various datasets, network architectures, and model capacities. Moreover, several intriguing properties of fairness are identified. First, the unfairness lies in problematic representation rather than classifier bias. Second, with the proposed concept of Model Prediction Bias, we investigate the origins of problematic representation during optimization. Our findings reveal that models tend to exhibit greater prediction biases for classes that are more challenging to recognize. It means that more other classes will be confused with harder classes. Then the False Positives (FPs) will dominate the learning in optimization, thus leading to their poor accuracy. Further, we conclude that data augmentation and representation learning algorithms improve overall performance by promoting fairness to some degree in image classification. The Code is available at https://github.com/dvlab-research/Parametric-Contrastive-Learning.
General Purpose Image Encoder DINOv2 for Medical Image Registration
Song, Xinrui, Xu, Xuanang, Yan, Pingkun
Existing medical image registration algorithms rely on either dataset specific training or local texture-based features to align images. The former cannot be reliably implemented without large modality-specific training datasets, while the latter lacks global semantics thus could be easily trapped at local minima. In this paper, we present a training-free deformable image registration method, DINO-Reg, leveraging a general purpose image encoder DINOv2 for image feature extraction. The DINOv2 encoder was trained using the ImageNet data containing natural images. We used the pretrained DINOv2 without any finetuning. Our method feeds the DINOv2 encoded features into a discrete optimizer to find the optimal deformable registration field. We conducted a series of experiments to understand the behavior and role of such a general purpose image encoder in the application of image registration. Combined with handcrafted features, our method won the first place in the recent OncoReg Challenge. To our knowledge, this is the first application of general vision foundation models in medical image registration.
Representation Learning for Frequent Subgraph Mining
Ying, Rex, Fu, Tianyu, Wang, Andrew, You, Jiaxuan, Wang, Yu, Leskovec, Jure
Identifying frequent subgraphs, also called network motifs, is crucial in analyzing and predicting properties of real-world networks. However, finding large commonly-occurring motifs remains a challenging problem not only due to its NP-hard subroutine of subgraph counting, but also the exponential growth of the number of possible subgraphs patterns. Here we present Subgraph Pattern Miner (SPMiner), a novel neural approach for approximately finding frequent subgraphs in a large target graph. SPMiner combines graph neural networks, order embedding space, and an efficient search strategy to identify network subgraph patterns that appear most frequently in the target graph. SPMiner first decomposes the target graph into many overlapping subgraphs and then encodes each subgraph into an order embedding space. SPMiner then uses a monotonic walk in the order embedding space to identify frequent motifs. Compared to existing approaches and possible neural alternatives, SPMiner is more accurate, faster, and more scalable. For 5- and 6-node motifs, we show that SPMiner can almost perfectly identify the most frequent motifs while being 100x faster than exact enumeration methods. In addition, SPMiner can also reliably identify frequent 10-node motifs, which is well beyond the size limit of exact enumeration approaches. And last, we show that SPMiner can find large up to 20 node motifs with 10-100x higher frequency than those found by current approximate methods.
WeakSAM: Segment Anything Meets Weakly-supervised Instance-level Recognition
Zhu, Lianghui, Zhou, Junwei, Liu, Yan, Hao, Xin, Liu, Wenyu, Wang, Xinggang
Weakly supervised visual recognition using inexact supervision is a critical yet challenging learning problem. It significantly reduces human labeling costs and traditionally relies on multi-instance learning and pseudo-labeling. This paper introduces WeakSAM and solves the weakly-supervised object detection (WSOD) and segmentation by utilizing the pre-learned world knowledge contained in a vision foundation model, i.e., the Segment Anything Model (SAM). WeakSAM addresses two critical limitations in traditional WSOD retraining, i.e., pseudo ground truth (PGT) incompleteness and noisy PGT instances, through adaptive PGT generation and Region of Interest (RoI) drop regularization. It also addresses the SAM's problems of requiring prompts and category unawareness for automatic object detection and segmentation. Our results indicate that WeakSAM significantly surpasses previous state-of-the-art methods in WSOD and WSIS benchmarks with large margins, i.e. average improvements of 7.4% and 8.5%, respectively. The code is available at \url{https://github.com/hustvl/WeakSAM}.
ICON: Improving Inter-Report Consistency of Radiology Report Generation via Lesion-aware Mix-up Augmentation
Hou, Wenjun, Cheng, Yi, Xu, Kaishuai, Hu, Yan, Li, Wenjie, Liu, Jiang
Previous research on radiology report generation has made significant progress in terms of increasing the clinical accuracy of generated reports. In this paper, we emphasize another crucial quality that it should possess, i.e., inter-report consistency, which refers to the capability of generating consistent reports for semantically equivalent radiographs. This quality is even of greater significance than the overall report accuracy in terms of ensuring the system's credibility, as a system prone to providing conflicting results would severely erode users' trust. Regrettably, existing approaches struggle to maintain inter-report consistency, exhibiting biases towards common patterns and susceptibility to lesion variants. To address this issue, we propose ICON, which improves the inter-report consistency of radiology report generation. Aiming at enhancing the system's ability to capture the similarities in semantically equivalent lesions, our approach involves first extracting lesions from input images and examining their characteristics. Then, we introduce a lesion-aware mix-up augmentation technique to ensure that the representations of the semantically equivalent lesions align with the same attributes, by linearly interpolating them during the training phase. Extensive experiments on three publicly available chest X-ray datasets verify the effectiveness of our approach, both in terms of improving the consistency and accuracy of the generated reports.
Parallel Structures in Pre-training Data Yield In-Context Learning
Chen, Yanda, Zhao, Chen, Yu, Zhou, McKeown, Kathleen, He, He
Pre-trained language models (LMs) are capable of in-context learning (ICL): they can adapt to a task with only a few examples given in the prompt without any parameter update. However, it is unclear where this capability comes from as there is a stark distribution shift between pre-training text and ICL prompts. In this work, we study what patterns of the pre-training data contribute to ICL. We find that LMs' ICL ability depends on $\textit{parallel structures}$ in the pre-training data -- pairs of phrases following similar templates in the same context window. Specifically, we detect parallel structures by checking whether training on one phrase improves prediction of the other, and conduct ablation experiments to study their effect on ICL. We show that removing parallel structures in the pre-training data reduces LMs' ICL accuracy by 51% (vs 2% from random ablation). This drop persists even when excluding common patterns such as n-gram repetitions and long-range dependency, showing the diversity and generality of parallel structures. A closer look at the detected parallel structures indicates that they cover diverse linguistic tasks and span long distances in the data.
A real-time Artificial Intelligence system for learning Sign Language
A primary challenge for the deaf and hearing-impaired community stems from the communication gap with the hearing society, which can greatly impact their daily lives and result in social exclusion. To foster inclusivity in society, our endeavor focuses on developing a cost-effective, resource-efficient, and open technology based on Artificial Intelligence, designed to assist people in learning and using Sign Language for communication. The analysis presented in this research paper intends to enrich the recent academic scientific literature on Sign Language solutions based on Artificial Intelligence, with a particular focus on American Sign Language (ASL). This research has yielded promising preliminary results and serves as a basis for further development.
Structural Risk Minimization for Character Recognition
The method of Structural Risk Minimization refers to tuning the capacity of the classifier to the available amount of training data. This capac(cid:173) ity is influenced by several factors, including: (1) properties of the input space, (2) nature and structure of the classifier, and (3) learning algorithm. Actions based on these three factors are combined here to control the ca(cid:173) pacity of linear classifiers and improve generalization on the problem of handwritten digit recognition. A common way of training a given classifier is to adjust the parameters w in the classification function F( x, w) to minimize the training error Etrain, i.e. the fre(cid:173) quency of errors on a set of p training examples. Etrain estimates the expected risk based on the empirical data provided by the p available examples.
Evaluating DTW Measures via a Synthesis Framework for Time-Series Data
Rajput, Kishansingh, Nguyen, Duong Binh, Chen, Guoning
Time-series data originate from various applications that describe specific observations or quantities of interest over time. Their analysis often involves the comparison across different time-series data sequences, which in turn requires the alignment of these sequences. Dynamic Time Warping (DTW) is the standard approach to achieve an optimal alignment between two temporal signals. Different variations of DTW have been proposed to address various needs for signal alignment or classifications. However, a comprehensive evaluation of their performance in these time-series data processing tasks is lacking. Most DTW measures perform well on certain types of time-series data without a clear explanation of the reason. To address that, we propose a synthesis framework to model the variation between two time-series data sequences for comparison. Our synthesis framework can produce a realistic initial signal and deform it with controllable variations that mimic real-world scenarios. With this synthesis framework, we produce a large number of time-series sequence pairs with different but known variations, which are used to assess the performance of a number of well-known DTW measures for the tasks of alignment and classification. We report their performance on different variations and suggest the proper DTW measure to use based on the type of variations between two time-series sequences. This is the first time such a guideline is presented for selecting a proper DTW measure. To validate our conclusion, we apply our findings to real-world applications, i.e., the detection of the formation top for the oil and gas industry and the pattern search in streamlines for flow visualization.