Pattern Recognition
Attention Where It Matters: Rethinking Visual Document Understanding with Selective Region Concentration
Cao, Haoyu, Bao, Changcun, Liu, Chaohu, Chen, Huang, Yin, Kun, Liu, Hao, Liu, Yinsong, Jiang, Deqiang, Sun, Xing
We propose a novel end-to-end document understanding model called SeRum (SElective Region Understanding Model) for extracting meaningful information from document images, including document analysis, retrieval, and office automation. Unlike state-of-the-art approaches that rely on multi-stage technical schemes and are computationally expensive, SeRum converts document image understanding and recognition tasks into a local decoding process of the visual tokens of interest, using a content-aware token merge module. This mechanism enables the model to pay more attention to regions of interest generated by the query decoder, improving the model's effectiveness and speeding up the decoding speed of the generative scheme. We also designed several pre-training tasks to enhance the understanding and local awareness of the model. Experimental results demonstrate that SeRum achieves state-of-the-art performance on document understanding tasks and competitive results on text spotting tasks. SeRum represents a substantial advancement towards enabling efficient and effective end-to-end document understanding.
The HAPPY HEDGEHOG Project
Bendel, Oliver, Graf, Emanuel, Bollier, Kevin
Semi-autonomous machines, autonomous machines and robots inhabit closed, semi-closed and open environments, more structured environments like the household or more unstructured environments like cultural landscapes or the wilderness. There they encounter domestic animals, farm animals, working animals, and wild animals. These creatures could be disturbed, displaced, injured, or killed by the machines. Within the context of machine ethics and social robotics, the School of Business FHNW developed several design studies and prototypes for animal-friendly machines, which can be understood as moral and social machines in the spirit of these disciplines. In 2019-20, a team led by the main author developed a prototype robot lawnmower that can recognize hedgehogs, interrupt its work for them and thus protect them. Every year many of these animals die worldwide because of traditional service robots. HAPPY HEDGEHOG (HHH), as the invention is called, could be a solution to this problem. This article begins by providing an introduction to the background. Then it focuses on navigation (where the machine comes across certain objects that need to be recognized) and thermal and image recognition (with the help of machine learning) of the machine. It also presents obvious weaknesses and possible improvements. The results could be relevant for an industry that wants to market their products as animal-friendly machines.
Interactive Multi Interest Process Pattern Discovery
Vazifehdoostirani, Mozhgan, Genga, Laura, Lu, Xixi, Verhoeven, Rob, van Laarhoven, Hanneke, Dijkman, Remco
Existing PPDMs typically are unsupervised and focus on a single dimension of interest, such as discovering frequent patterns. We present an interactive multi-interest-driven framework for process pattern discovery aimed at identifying patterns that are optimal according to a multi-dimensional analysis goal. The proposed approach is iterative and interactive, thus taking experts' knowledge into account during the discovery process. The paper focuses on a concrete analysis goal, i.e., deriving process patterns that affect the process outcome. We evaluate the approach on real-world event logs in both interactive and fully automated settings. The approach extracted meaningful patterns validated by expert knowledge in the interactive setting. Patterns extracted in the automated settings consistently led to prediction performance comparable to or better than patterns derived considering single-interest dimensions without requiring user-defined thresholds.
DynamicISP: Dynamically Controlled Image Signal Processor for Image Recognition
Yoshimura, Masakazu, Otsuka, Junji, Irie, Atsushi, Ohashi, Takeshi
Image Signal Processors (ISPs) play important roles in image recognition tasks as well as in the perceptual quality of captured images. In most cases, experts make a lot of effort to manually tune many parameters of ISPs, but the parameters are sub-optimal. In the literature, two types of techniques have been actively studied: a machine learning-based parameter tuning technique and a DNN-based ISP technique. The former is lightweight but lacks expressive power. The latter has expressive power, but the computational cost is too heavy on edge devices. To solve these problems, we propose "DynamicISP," which consists of multiple classical ISP functions and dynamically controls the parameters of each frame according to the recognition result of the previous frame. We show our method successfully controls the parameters of multiple ISP functions and achieves state-of-the-art accuracy with low computational cost in single and multi-category object detection tasks.
Enhancing Bloodstain Analysis Through AI-Based Segmentation: Leveraging Segment Anything Model for Crime Scene Investigation
Bloodstain pattern analysis plays a crucial role in crime scene investigations by providing valuable information through the study of unique blood patterns. Conventional image analysis methods, like Thresholding and Contrast, impose stringent requirements on the image background and is labor-intensive in the context of droplet image segmentation. The Segment Anything Model (SAM), a recently proposed method for extensive image recognition, is yet to be adequately assessed for its accuracy and efficiency on bloodstain image segmentation. This paper explores the application of pre-trained SAM and fine-tuned SAM on bloodstain image segmentation with diverse image backgrounds. Experiment results indicate that both pre-trained and fine-tuned SAM perform the bloodstain image segmentation task with satisfactory accuracy and efficiency, while fine-tuned SAM achieves an overall 2.2\% accuracy improvement than pre-trained SAM and 4.70\% acceleration in terms of speed for image recognition. Analysis of factors that influence bloodstain recognition is carried out. This research demonstrates the potential application of SAM on bloodstain image segmentation, showcasing the effectiveness of Artificial Intelligence application in criminology research. We release all code and demos at \url{https://github.com/Zdong104/Bloodstain_Analysis_Ai_Tool}
Graph Edit Distance Learning via Different Attention
Lv, Jiaxi, Zhang, Liang, Huang, Yi, Huang, Jiancheng, Chen, Shifeng
Recently, more and more research has focused on using Graph Neural Networks (GNN) to solve the Graph Similarity Computation problem (GSC), i.e., computing the Graph Edit Distance (GED) between two graphs. These methods treat GSC as an end-to-end learnable task, and the core of their architecture is the feature fusion modules to interact with the features of two graphs. Existing methods consider that graph-level embedding is difficult to capture the differences in local small structures between two graphs, and thus perform fine-grained feature fusion on node-level embedding can improve the accuracy, but leads to greater time and memory consumption in the training and inference phases. However, this paper proposes a novel graph-level fusion module Different Attention (DiffAtt), and demonstrates that graph-level fusion embeddings can substantially outperform these complex node-level fusion embeddings. We posit that the relative difference structure of the two graphs plays an important role in calculating their GED values. To this end, DiffAtt uses the difference between two graph-level embeddings as an attentional mechanism to capture the graph structural difference of the two graphs. Based on DiffAtt, a new GSC method, named Graph Edit Distance Learning via Different Attention (REDRAFT), is proposed, and experimental results demonstrate that REDRAFT achieves state-of-the-art performance in 23 out of 25 metrics in five benchmark datasets. Especially on MSE, it respectively outperforms the second best by 19.9%, 48.8%, 29.1%, 31.6%, and 2.2%. Moreover, we propose a quantitative test Remaining Subgraph Alignment Test (RESAT) to verify that among all graph-level fusion modules, the fusion embedding generated by DiffAtt can best capture the structural differences between two graphs.
Time-to-Pattern: Information-Theoretic Unsupervised Learning for Scalable Time Series Summarization
Ghods, Alireza, Hoang, Trong Nghia, Cook, Diane
Data summarization is the process of generating interpretable and representative subsets from a dataset. Existing time series summarization approaches often search for recurring subsequences using a set of manually devised similarity functions to summarize the data. However, such approaches are fraught with limitations stemming from an exhaustive search coupled with a heuristic definition of series similarity. Such approaches affect the diversity and comprehensiveness of the generated data summaries. To mitigate these limitations, we introduce an approach to time series summarization, called Time-to-Pattern (T2P), which aims to find a set of diverse patterns that together encode the most salient information, following the notion of minimum description length. T2P is implemented as a deep generative model that learns informative embeddings of the discrete time series on a latent space specifically designed to be interpretable. Our synthetic and real-world experiments reveal that T2P discovers informative patterns, even in noisy and complex settings. Furthermore, our results also showcase the improved performance of T2P over previous work in pattern diversity and processing scalability, which conclusively demonstrate the algorithm's effectiveness for time series summarization.
Nougat: Neural Optical Understanding for Academic Documents
Blecher, Lukas, Cucurull, Guillem, Scialom, Thomas, Stojnic, Robert
The majority of scientific knowledge is stored in books or published in scientific journals, most commonly in the Portable Document Format (PDF). Next to HTML, PDFs are the second most prominent data format on the internet, making up 2.4% of common crawl [1]. However, the information stored in these files is very difficult to extract into any other formats. This is especially true for highly specialized documents, such as scientific research papers, where the semantic information of mathematical expressions is lost. Existing Optical Character Recognition (OCR) engines, such as Tesseract OCR [2], excel at detecting and classifying individual characters and words in an image, but fail to understand the relationship between them due to their line-by-line approach. This means that they treat superscripts and subscripts in the same way as the surrounding text, which is a significant drawback for mathematical expressions. In mathematical notations like fractions, exponents, and matrices, relative positions of characters are crucial. Converting academic research papers into machine-readable text also enables accessibility and searchability of science as a whole. The information of millions of academic papers can not be fully accessed because they are locked behind an unreadable format.
UNISOUND System for VoxCeleb Speaker Recognition Challenge 2023
Zheng, Yu, Zhang, Yajun, Niu, Chuanying, Zhan, Yibin, Long, Yanhua, Xu, Dongxing
This report describes the UNISOUND submission for Track1 and Track2 of VoxCeleb Speaker Recognition Challenge 2023 (VoxSRC 2023). We submit the same system on Track 1 and Track 2, which is trained with only VoxCeleb2-dev. Large-scale ResNet and RepVGG architectures are developed for the challenge. We propose a consistency-aware score calibration method, which leverages the stability of audio voiceprints in similarity score by a Consistency Measure Factor (CMF). CMF brings a huge performance boost in this challenge. Our final system is a fusion of six models and achieves the first place in Track 1 and second place in Track 2 of VoxSRC 2023. The minDCF of our submission is 0.0855 and the EER is 1.5880%.
UTRNet: High-Resolution Urdu Text Recognition In Printed Documents
Rahman, Abdur, Ghosh, Arjun, Arora, Chetan
In this paper, we propose a novel approach to address the challenges of printed Urdu text recognition using high-resolution, multi-scale semantic feature extraction. Our proposed UTRNet architecture, a hybrid CNN-RNN model, demonstrates state-of-the-art performance on benchmark datasets. To address the limitations of previous works, which struggle to generalize to the intricacies of the Urdu script and the lack of sufficient annotated real-world data, we have introduced the UTRSet-Real, a large-scale annotated real-world dataset comprising over 11,000 lines and UTRSet-Synth, a synthetic dataset with 20,000 lines closely resembling real-world and made corrections to the ground truth of the existing IIITH dataset, making it a more reliable resource for future research. We also provide UrduDoc, a benchmark dataset for Urdu text line detection in scanned documents. Additionally, we have developed an online tool for end-to-end Urdu OCR from printed documents by integrating UTRNet with a text detection model. Our work not only addresses the current limitations of Urdu OCR but also paves the way for future research in this area and facilitates the continued advancement of Urdu OCR technology. The project page with source code, datasets, annotations, trained models, and online tool is available at abdur75648.github.io/UTRNet.