Pattern Recognition
ANSR-DT: An Adaptive Neuro-Symbolic Learning and Reasoning Framework for Digital Twins
Hakim, Safayat Bin, Adil, Muhammad, Velasquez, Alvaro, Song, Houbing Herbert
In this paper, we propose an Adaptive Neuro-Symbolic Learning Framework for digital twin technology called ``ANSR-DT." Our approach combines pattern recognition algorithms with reinforcement learning and symbolic reasoning to enable real-time learning and adaptive intelligence. This integration enhances the understanding of the environment and promotes continuous learning, leading to better and more effective decision-making in real-time for applications that require human-machine collaboration. We evaluated the \textit{ANSR-DT} framework for its ability to learn and adapt to dynamic patterns, observing significant improvements in decision accuracy, reliability, and interpretability when compared to existing state-of-the-art methods. However, challenges still exist in extracting and integrating symbolic rules in complex environments, which limits the full potential of our framework in heterogeneous settings. Moreover, our ongoing research aims to address this issue in the future by ensuring seamless integration of neural models at large. In addition, our open-source implementation promotes reproducibility and encourages future research to build on our foundational work.
Anonymization of Documents for Law Enforcement with Machine Learning
Eberhardinger, Manuel, Takenaka, Patrick, Grieรhaber, Daniel, Maucher, Johannes
The steadily increasing utilization of data-driven methods and approaches in areas that handle sensitive personal information such as in law enforcement mandates an ever increasing effort in these institutions to comply with data protection guidelines. In this work, we present a system for automatically anonymizing images of scanned documents, reducing manual effort while ensuring data protection compliance. Our method considers the viability of further forensic processing after anonymization by minimizing automatically redacted areas by combining automatic detection of sensitive regions with knowledge from a manually anonymized reference document. Using a self-supervised image model for instance retrieval of the reference document, our approach requires only one anonymized example to efficiently redact all documents of the same type, significantly reducing processing time. We show that our approach outperforms both a purely automatic redaction system and also a naive copy-paste scheme of the reference anonymization to other documents on a hand-crafted dataset of ground truth redactions.
InstructOCR: Instruction Boosting Scene Text Spotting
Duan, Chen, Jiang, Qianyi, Fu, Pei, Chen, Jiamin, Li, Shengxi, Wang, Zining, Guo, Shan, Luo, Junfeng
In the field of scene text spotting, previous OCR methods primarily relied on image encoders and pre-trained text information, but they often overlooked the advantages of incorporating human language instructions. To address this gap, we propose InstructOCR, an innovative instruction-based scene text spotting model that leverages human language instructions to enhance the understanding of text within images. Our framework employs both text and image encoders during training and inference, along with instructions meticulously designed based on text attributes. This approach enables the model to interpret text more accurately and flexibly. Extensive experiments demonstrate the effectiveness of our model and we achieve state-of-the-art results on widely used benchmarks. Furthermore, the proposed framework can be seamlessly applied to scene text VQA tasks. By leveraging instruction strategies during pre-training, the performance on downstream VQA tasks can be significantly improved, with a 2.6% increase on the TextVQA dataset and a 2.1% increase on the ST-VQA dataset. These experimental results provide insights into the benefits of incorporating human language instructions for OCR-related tasks.
eKalibr: Dynamic Intrinsic Calibration for Event Cameras From First Principles of Events
Chen, Shuolong, Li, Xingxing, Yuan, Liu, Liu, Ziao
The bio-inspired event camera has garnered extensive research attention in recent years, owing to its significant potential derived from its high dynamic range and low latency characteristics. Similar to the standard camera, the event camera requires precise intrinsic calibration to facilitate further high-level visual applications, such as pose estimation and mapping. While several calibration methods for event cameras have been proposed, most of them are either (i) engineering-driven, heavily relying on conventional image-based calibration pipelines, or (ii) inconvenient, requiring complex instrumentation. To this end, we propose an accurate and convenient intrinsic calibration method for event cameras, named eKalibr, which builds upon a carefully designed event-based circle grid pattern recognition algorithm. To extract target patterns from events, we perform event-based normal flow estimation to identify potential events generated by circle edges, and cluster them spatially. Subsequently, event clusters associated with the same grid circles are matched and grouped using normal flows, for subsequent time-varying ellipse estimation. Fitted ellipse centers are time-synchronized, for final grid pattern recognition. We conducted extensive experiments to evaluate the performance of eKalibr in terms of pattern extraction and intrinsic calibration. The implementation of eKalibr is open-sourced at (https://github.com/Unsigned-Long/eKalibr) to benefit the research community.
Geometry Restoration and Dewarping of Camera-Captured Document Images
Istomin, Valery, Pereziabov, Oleg, Afanasyev, Ilya
This research focuses on developing a method for restoring the topology of digital images of paper documents captured by a camera, using algorithms for detection, segmentation, geometry restoration, and dewarping. Our methodology employs deep learning (DL) for document outline detection, followed by computer vision (CV) to create a topological 2D grid using cubic polynomial interpolation and correct nonlinear distortions by remapping the image. Using classical CV methods makes the document topology restoration process more efficient and faster, as it requires significantly fewer computational resources and memory. We developed a new pipeline for automatic document dewarping and reconstruction, along with a framework and annotated dataset to demonstrate its efficiency. Our experiments confirm the promise of our methodology and its superiority over existing benchmarks (including mobile apps and popular DL solutions, such as RectiNet, DocGeoNet, and DocTr++) both visually and in terms of document readability via Optical Character Recognition (OCR) and geometry restoration metrics. This paves the way for creating high-quality digital copies of paper documents and enhancing the efficiency of OCR systems. Project page: https://github.com/HorizonParadox/DRCCBI
Deep Learning in Palmprint Recognition-A Comprehensive Survey
Gao, Chengrui, Yang, Ziyuan, Jia, Wei, Leng, Lu, Zhang, Bob, Teoh, Andrew Beng Jin
Palmprint recognition has emerged as a prominent biometric technology, widely applied in diverse scenarios. Traditional handcrafted methods for palmprint recognition often fall short in representation capability, as they heavily depend on researchers' prior knowledge. Deep learning (DL) has been introduced to address this limitation, leveraging its remarkable successes across various domains. While existing surveys focus narrowly on specific tasks within palmprint recognition-often grounded in traditional methodologies-there remains a significant gap in comprehensive research exploring DL-based approaches across all facets of palmprint recognition. This paper bridges that gap by thoroughly reviewing recent advancements in DL-powered palmprint recognition. The paper systematically examines progress across key tasks, including region-of-interest segmentation, feature extraction, and security/privacy-oriented challenges. Beyond highlighting these advancements, the paper identifies current challenges and uncovers promising opportunities for future research. By consolidating state-of-the-art progress, this review serves as a valuable resource for researchers, enabling them to stay abreast of cutting-edge technologies and drive innovation in palmprint recognition.
OCRBench v2: An Improved Benchmark for Evaluating Large Multimodal Models on Visual Text Localization and Reasoning
Fu, Ling, Yang, Biao, Kuang, Zhebin, Song, Jiajun, Li, Yuzhe, Zhu, Linghao, Luo, Qidi, Wang, Xinyu, Lu, Hao, Huang, Mingxin, Li, Zhang, Tang, Guozhi, Shan, Bin, Lin, Chunhui, Liu, Qi, Wu, Binghong, Feng, Hao, Liu, Hao, Huang, Can, Tang, Jingqun, Chen, Wei, Jin, Lianwen, Liu, Yuliang, Bai, Xiang
Scoring the Optical Character Recognition (OCR) capabilities of Large Multimodal Models (LMMs) has witnessed growing interest recently. Existing benchmarks have highlighted the impressive performance of LMMs in text recognition; however, their abilities on certain challenging tasks, such as text localization, handwritten content extraction, and logical reasoning, remain underexplored. To bridge this gap, we introduce OCRBench v2, a large-scale bilingual text-centric benchmark with currently the most comprehensive set of tasks (4x more tasks than the previous multi-scene benchmark OCRBench), the widest coverage of scenarios (31 diverse scenarios including street scene, receipt, formula, diagram, and so on), and thorough evaluation metrics, with a total of 10,000 human-verified question-answering pairs and a high proportion of difficult samples. After carefully benchmarking state-of-the-art LMMs on OCRBench v2, we find that 20 out of 22 LMMs score below 50 (100 in total) and suffer from five-type limitations, including less frequently encountered text recognition, fine-grained perception, layout perception, complex element parsing, and logical reasoning. The benchmark and evaluation scripts are available at https://github.com/Yuliang-liu/MultimodalOCR.
VoxVietnam: a Large-Scale Multi-Genre Dataset for Vietnamese Speaker Recognition
Vu, Hoang Long, Dat, Phuong Tuan, Nhi, Pham Thao, Hao, Nguyen Song, Trang, Nguyen Thi Thu
Recent research in speaker recognition aims to address vulnerabilities due to variations between enrolment and test utterances, particularly in the multi-genre phenomenon where the utterances are in different speech genres. Previous resources for Vietnamese speaker recognition are either limited in size or do not focus on genre diversity, leaving studies in multi-genre effects unexplored. This paper introduces VoxVietnam, the first multi-genre dataset for Vietnamese speaker recognition with over 187,000 utterances from 1,406 speakers and an automated pipeline to construct a dataset on a large scale from public sources. Our experiments show the challenges posed by the multi-genre phenomenon to models trained on a single-genre dataset, and demonstrate a significant increase in performance upon incorporating the VoxVietnam into the training process. Our experiments are conducted to study the challenges of the multi-genre phenomenon in speaker recognition and the performance gain when the proposed dataset is used for multi-genre training.
Dual Channel Multi-Attention in ViT for Biometric Authentication using Forehead Subcutaneous Vein Pattern and Periocular Pattern
Sharma, Arun K., Bhattacharya, Shubhobrata, Reza, Motahar
Traditional biometric systems, like face and fingerprint recognition, have encountered significant setbacks due to wearing face masks and hygiene concerns. To meet the challenges of the partially covered face due to face masks and hygiene concerns of fingerprint recognition, this paper proposes a novel dual-channel multi-attention Vision Transformer (ViT) framework for biometric authentication using forehead subcutaneous vein patterns and periocular patterns, offering a promising alternative to traditional methods, capable of performing well even with face masks and without any physical touch. The proposed framework leverages a dual-channel ViT architecture, designed to handle two distinct biometric traits. It can capture long-range dependencies of independent features from the vein and periocular patterns. A custom classifier is then designed to integrate the independently extracted features, producing a final class prediction. The performance of the proposed algorithm was rigorously evaluated using the Forehead Subcutaneous Vein Pattern and Periocular Biometric Pattern (FSVP-PBP) database. The results demonstrated the superiority of the algorithm over state-of-the-art methods, achieving remarkable classification accuracy of $99.3 \pm 0.02\%$ with the combined vein and periocular patterns.
HAND: Hierarchical Attention Network for Multi-Scale Handwritten Document Recognition and Layout Analysis
Hamdan, Mohammed, Rahiche, Abderrahmane, Cheriet, Mohamed
Handwritten document recognition (HDR) is one of the most challenging tasks in the field of computer vision, due to the various writing styles and complex layouts inherent in handwritten texts. Traditionally, this problem has been approached as two separate tasks, handwritten text recognition and layout analysis, and struggled to integrate the two processes effectively. This paper introduces HAND (Hierarchical Attention Network for Multi-Scale Document), a novel end-to-end and segmentation-free architecture for simultaneous text recognition and layout analysis tasks. Our model's key components include an advanced convolutional encoder integrating Gated Depth-wise Separable and Octave Convolutions for robust feature extraction, a Multi-Scale Adaptive Processing (MSAP) framework that dynamically adjusts to document complexity and a hierarchical attention decoder with memory-augmented and sparse attention mechanisms. These components enable our model to scale effectively from single-line to triple-column pages while maintaining computational efficiency. Additionally, HAND adopts curriculum learning across five complexity levels. To improve the recognition accuracy of complex ancient manuscripts, we fine-tune and integrate a Domain-Adaptive Pre-trained mT5 model for post-processing refinement. Extensive evaluations on the READ 2016 dataset demonstrate the superior performance of HAND, achieving up to 59.8% reduction in CER for line-level recognition and 31.2% for page-level recognition compared to state-of-the-art methods. The model also maintains a compact size of 5.60M parameters while establishing new benchmarks in both text recognition and layout analysis. Source code and pre-trained models are available at : https://github.com/MHHamdan/HAND.