Pattern Recognition
PDZSeg: Adapting the Foundation Model for Dissection Zone Segmentation with Visual Prompts in Robot-assisted Endoscopic Submucosal Dissection
Xu, Mengya, Mo, Wenjin, Wang, Guankun, Gao, Huxin, Wang, An, Li, Zhen, Yang, Xiaoxiao, Ren, Hongliang
Endoscopic Submucosal Dissection (ESD) is a surgical procedure employed in the treatment of early-stage gastrointestinal cancers [1, 2]. This procedure entails a complex series of dissection maneuvers that require significant skill to determine the dissection zone. In traditional ESD, a transparent cap is employed to retract lesions, which can often obscure the view of the submucosal layer and lead to an incomplete dissection zone. Conversely, our robot-assisted ESD [3] offers better visualization of the submucosal layer, resulting in a more completed dissection zone by utilizing robotic instruments that enable independent control over retraction and dissection. Achieving successful submucosal dissection requires the careful excision of the lesion or mucosal layer along with the complete submucosal layer while ensuring that both the underlying muscular layer and the mucosal surface remain unharmed. If the electric knife inadvertently contacts tissue outside the designated dissection area, it can lead to damage to the muscle layer, increasing the risk of perforations. Such complications not only elevate the surgical risks but can also complicate the patient's recovery. Therefore, it is imperative to maintain a precise dissection zone during endoscopic procedures. Effective guidance can help ensure that surgeons are adept at identifying and adhering to appropriate dissection boundaries and enhance the overall safety of endoscopic submucosal dissection (ESD).
On the Generalization of Handwritten Text Recognition Models
Garrido-Munoz, Carlos, Calvo-Zaragoza, Jorge
Recent advances in Handwritten Text Recognition (HTR) have led to significant reductions in transcription errors on standard benchmarks under the i.i.d. assumption, thus focusing on minimizing in-distribution (ID) errors. However, this assumption does not hold in real-world applications, which has motivated HTR research to explore Transfer Learning and Domain Adaptation techniques. In this work, we investigate the unaddressed limitations of HTR models in generalizing to out-of-distribution (OOD) data. We adopt the challenging setting of Domain Generalization, where models are expected to generalize to OOD data without any prior access. To this end, we analyze 336 OOD cases from eight state-of-the-art HTR models across seven widely used datasets, spanning five languages. Additionally, we study how HTR models leverage synthetic data to generalize. We reveal that the most significant factor for generalization lies in the textual divergence between domains, followed by visual divergence. We demonstrate that the error of HTR models in OOD scenarios can be reliably estimated, with discrepancies falling below 10 points in 70\% of cases. We identify the underlying limitations of HTR models, laying the foundation for future research to address this challenge.
ENCLIP: Ensembling and Clustering-Based Contrastive Language-Image Pretraining for Fashion Multimodal Search with Limited Data and Low-Quality Images
Naik, Prithviraj Purushottam, Agarwal, Rohit
Multimodal search has revolutionized the fashion industry, providing a seamless and intuitive way for users to discover and explore fashion items. Based on their preferences, style, or specific attributes, users can search for products by combining text and image information. Text-to-image searches enable users to find visually similar items or describe products using natural language. This paper presents an innovative approach called ENCLIP, for enhancing the performance of the Contrastive Language-Image Pretraining (CLIP) model, specifically in Multimodal Search targeted towards the domain of fashion intelligence. This method focuses on addressing the challenges posed by limited data availability and low-quality images. This paper proposes an algorithm that involves training and ensembling multiple instances of the CLIP model, and leveraging clustering techniques to group similar images together. The experimental findings presented in this study provide evidence of the effectiveness of the methodology. This approach unlocks the potential of CLIP in the domain of fashion intelligence, where data scarcity and image quality issues are prevalent. Overall, the ENCLIP method represents a valuable contribution to the field of fashion intelligence and provides a practical solution for optimizing the CLIP model in scenarios with limited data and low-quality images.
Proceedings of the 6th International Workshop on Reading Music Systems
Calvo-Zaragoza, Jorge, Pacha, Alexander, Shatri, Elona
The International Workshop on Reading Music Systems (WoRMS) is a workshop that tries to connect researchers who develop systems for reading music, such as in the field of Optical Music Recognition, with other researchers and practitioners that could benefit from such systems, like librarians or musicologists. The relevant topics of interest for the workshop include, but are not limited to: Music reading systems; Optical music recognition; Datasets and performance evaluation; Image processing on music scores; Writer identification; Authoring, editing, storing and presentation systems for music scores; Multi-modal systems; Novel input-methods for music to produce written music; Web-based Music Information Retrieval services; Applications and projects; Use-cases related to written music. These are the proceedings of the 6th International Workshop on Reading Music Systems, held Online on November 22nd 2024.
VisGraphVar: A Benchmark Generator for Assessing Variability in Graph Analysis Using Large Vision-Language Models
Sartori, Camilo Chacรณn, Blum, Christian, Bistaffa, Filippo
The fast advancement of Large Vision-Language Models (LVLMs) has shown immense potential. These models are increasingly capable of tackling abstract visual tasks. Geometric structures, particularly graphs with their inherent flexibility and complexity, serve as an excellent benchmark for evaluating these models' predictive capabilities. While human observers can readily identify subtle visual details and perform accurate analyses, our investigation reveals that state-of-the-art LVLMs exhibit consistent limitations in specific visual graph scenarios, especially when confronted with stylistic variations. In response to these challenges, we introduce VisGraphVar (Visual Graph Variability), a customizable benchmark generator able to produce graph images for seven distinct task categories (detection, classification, segmentation, pattern recognition, link prediction, reasoning, matching), designed to systematically evaluate the strengths and limitations of individual LVLMs. We use VisGraphVar to produce 990 graph images and evaluate six LVLMs, employing two distinct prompting strategies, namely zero-shot and chain-of-thought. The findings demonstrate that variations in visual attributes of images (e.g., node labeling and layout) and the deliberate inclusion of visual imperfections, such as overlapping nodes, significantly affect model performance. This research emphasizes the importance of a comprehensive evaluation across graph-related tasks, extending beyond reasoning alone. VisGraphVar offers valuable insights to guide the development of more reliable and robust systems capable of performing advanced visual graph analysis.
A comprehensive survey of oracle character recognition: challenges, benchmarks, and beyond
Li, Jing, Chi, Xueke, Wang, Qiufeng, Wang, Dahan, Huang, Kaizhu, Liu, Yongge, Liu, Cheng-lin
Oracle character recognition-an analysis of ancient Chinese inscriptions found on oracle bones-has become a pivotal field intersecting archaeology, paleography, and historical cultural studies. Traditional methods of oracle character recognition have relied heavily on manual interpretation by experts, which is not only labor-intensive but also limits broader accessibility to the general public. With recent breakthroughs in pattern recognition and deep learning, there is a growing movement towards the automation of oracle character recognition (OrCR), showing considerable promise in tackling the challenges inherent to these ancient scripts. However, a comprehensive understanding of OrCR still remains elusive. Therefore, this paper presents a systematic and structured survey of the current landscape of OrCR research. We commence by identifying and analyzing the key challenges of OrCR. Then, we provide an overview of the primary benchmark datasets and digital resources available for OrCR. A review of contemporary research methodologies follows, in which their respective efficacies, limitations, and applicability to the complex nature of oracle characters are critically highlighted and examined. Additionally, our review extends to ancillary tasks associated with OrCR across diverse disciplines, providing a broad-spectrum analysis of its applications. We conclude with a forward-looking perspective, proposing potential avenues for future investigations that could yield significant advancements in the field.
TS-ACL: A Time Series Analytic Continual Learning Framework for Privacy-Preserving and Class-Incremental Pattern Recognition
Fan, Kejia, Li, Jiaxu, Lai, Songning, Lv, Linpu, Liu, Anfeng, Tang, Jianheng, Song, Houbing Herbert, Yue, Yutao, Zhuang, Huiping
Class-incremental pattern recognition in time series is a significant problem, which aims to learn from continually arriving streaming data examples with incremental classes. A primary challenge in this problem is catastrophic forgetting, where the incorporation of new data samples causes the models to forget previously learned information. While the replay-based methods achieve promising results by storing historical data to address catastrophic forgetting, they come with the invasion of data privacy. On the other hand, the exemplar-free methods preserve privacy but suffer from significantly decreased accuracy. To address these challenges, we proposed TS-ACL, a novel Time Series Analytic Continual Learning framework for privacy-preserving and class-incremental pattern recognition. Identifying gradient descent as the root of catastrophic forgetting, TS-ACL transforms each update of the model into a gradient-free analytical learning process with a closed-form solution. By leveraging a pre-trained frozen encoder for embedding extraction, TS-ACL only needs to recursively update an analytic classifier in a lightweight manner. This way, TS-ACL simultaneously achieves non-forgetting, privacy preservation, and lightweight consumption, making it widely suitable for various applications, particularly in edge computing scenarios. Extensive experiments on five benchmark datasets confirm the superior and robust performance of TS-ACL compared to existing advanced methods. Code is available at https://github.com/asdasdczxczq/TS-ACL.
Introduction to AI Safety, Ethics, and Society
Artificial Intelligence is rapidly embedding itself within militaries, economies, and societies, reshaping their very foundations. Given the depth and breadth of its consequences, it has never been more pressing to understand how to ensure that AI systems are safe, ethical, and have a positive societal impact. This book aims to provide a comprehensive approach to understanding AI risk. Our primary goals include consolidating fragmented knowledge on AI risk, increasing the precision of core ideas, and reducing barriers to entry by making content simpler and more comprehensible. The book has been designed to be accessible to readers from diverse backgrounds. You do not need to have studied AI, philosophy, or other such topics. The content is skimmable and somewhat modular, so that you can choose which chapters to read. We introduce mathematical formulas in a few places to specify claims more precisely, but readers should be able to understand the main points without these.
GraphRPM: Risk Pattern Mining on Industrial Large Attributed Graphs
Tian, Sheng, Zeng, Xintan, Hu, Yifei, Wang, Baokun, Liu, Yongchao, Jin, Yue, Meng, Changhua, Hong, Chuntao, Zhang, Tianyi, Wang, Weiqiang
Graph-based patterns are extensively employed and favored by practitioners within industrial companies due to their capacity to represent the behavioral attributes and topological relationships among users, thereby offering enhanced interpretability in comparison to blackbox models commonly utilized for classification and recognition tasks. For instance, within the scenario of transaction risk management, a graph pattern that is characteristic of a particular risk category can be readily employed to discern transactions fraught with risk, delineate networks of criminal activity, or investigate the methodologies employed by fraudsters. Nonetheless, graph data in industrial settings is often characterized by its massive scale, encompassing data sets with millions or even billions of nodes, making the manual extraction of graph patterns not only labor-intensive but also necessitating specialized knowledge in particular domains of risk. Moreover, existing methodologies for mining graph patterns encounter significant obstacles when tasked with analyzing large-scale attributed graphs. In this work, we introduce GraphRPM, an industry-purpose parallel and distributed risk pattern mining framework on large attributed graphs. The framework incorporates a novel edge-involved graph isomorphism network (EGIN) alongside optimized operations for parallel graph computation, which collectively contribute to a considerable reduction in computational complexity and resource expenditure. Moreover, the intelligent filtration of efficacious risky graph patterns is facilitated by the proposed evaluation metrics. Comprehensive experimental evaluations conducted on real-world datasets of varying sizes substantiate the capability of GraphRPM to adeptly address the challenges inherent in mining patterns from large-scale industrial attributed graphs, thereby underscoring its substantial value for industrial deployment. Keywords: Graph isomorphism network Graph neural network Largescale attributed graphs Risk pattern mining.
NeuReg: Domain-invariant 3D Image Registration on Human and Mouse Brains
Medical brain imaging relies heavily on image registration to accurately curate structural boundaries of brain features for various healthcare applications. Deep learning models have shown remarkable performance in image registration in recent years. Still, they often struggle to handle the diversity of 3D brain volumes, challenged by their structural and contrastive variations and their imaging domains. In this work, we present NeuReg, a Neuro-inspired 3D image registration architecture with the feature of domain invariance. NeuReg generates domain-agnostic representations of imaging features and incorporates a shifting window-based Swin Transformer block as the encoder. This enables our model to capture the variations across brain imaging modalities and species. We demonstrate a new benchmark in multi-domain publicly available datasets comprising human and mouse 3D brain volumes. Extensive experiments reveal that our model (NeuReg) outperforms the existing baseline deep learning-based image registration models and provides a high-performance boost on cross-domain datasets, where models are trained on 'source-only' domain and tested on completely 'unseen' target domains. Our work establishes a new state-of-the-art for domain-agnostic 3D brain image registration, underpinned by Neuro-inspired Transformer-based architecture.