Pattern Recognition
GeoLayoutLM: Geometric Pre-training for Visual Information Extraction
Luo, Chuwei, Cheng, Changxu, Zheng, Qi, Yao, Cong
Visual information extraction (VIE) plays an important role in Document Intelligence. Generally, it is divided into two tasks: semantic entity recognition (SER) and relation extraction (RE). Recently, pre-trained models for documents have achieved substantial progress in VIE, particularly in SER. However, most of the existing models learn the geometric representation in an implicit way, which has been found insufficient for the RE task since geometric information is especially crucial for RE. Moreover, we reveal another factor that limits the performance of RE lies in the objective gap between the pre-training phase and the fine-tuning phase for RE. To tackle these issues, we propose in this paper a multi-modal framework, named GeoLayoutLM, for VIE. GeoLayoutLM explicitly models the geometric relations in pre-training, which we call geometric pre-training. Geometric pre-training is achieved by three specially designed geometry-related pre-training tasks. Additionally, novel relation heads, which are pre-trained by the geometric pre-training tasks and fine-tuned for RE, are elaborately designed to enrich and enhance the feature representation. According to extensive experiments on standard VIE benchmarks, GeoLayoutLM achieves highly competitive scores in the SER task and significantly outperforms the previous state-of-the-arts for RE (\eg, the F1 score of RE on FUNSD is boosted from 80.35\% to 89.45\%). The code and models are publicly available at https://github.com/AlibabaResearch/AdvancedLiterateMachinery/tree/main/DocumentUnderstanding/GeoLayoutLM
Enhancing Artificial intelligence Policies with Fusion and Forecasting: Insights from Indian Patents Using Network Analysis
Kuniyil, Akhil, Kshitij, Avinash, Mandal, Kasturi
Abstract-- This paper presents a study of the interconnectivity and interdependence of various Artificial intelligence (AI) technologies through the use of centrality measures, clustering coefficients, and degree of fusion measures. By analyzing the technologies through different time windows and quantifying their importance, we have revealed important insights into the crucial components shaping the AI landscape and the maturity level of the domain. The results of this study have significant implications for future development and advancements in artificial intelligence and provide a clear understanding of key technology areas of fusion. Furthermore, this paper contributes to AI public policy research by offering a data-driven perspective on the current state and future direction of the field. However, it is important to acknowledge the limitations of this research and call for further studies to build on these results. With these findings, we hope to inform and guide future research in the field of AI, contributing to its continued growth and success. AI has the potential to revolutionize a wide range of industries from healthcare and finance to transportation and agriculture [1] last but not least environmental hard and societal changes [2]. With the ability to analyze vast amounts of data and automate tasks that were once exclusively performed by humans, AI is reshaping the way we live and work. Given the potential of AI, it is essential to study and understand its applications, fusion of technologies, changes over the years in the domain as well as societal impacts. This understanding is crucial for policymakers, as they must develop effective policies that keep pace with the rapid advancement of AI technology. Moreover, the study of AI is also relevant for individuals, businesses, and organizations, as they must be prepared to adapt to the changes brought about by AI. The study of AI is crucial in today's era to unlock the full potential of this groundbreaking technology and to address the challenges and opportunities it presents.
Constraining Representations Yields Models That Know What They Don't Know
Monteiro, Joao, Rodriguez, Pau, Noel, Pierre-Andre, Laradji, Issam, Vazquez, David
A well-known failure mode of neural networks is that they may confidently return erroneous predictions. Such unsafe behaviour is particularly frequent when the use case slightly differs from the training context, and/or in the presence of an adversary. This work presents a novel direction to address these issues in a broad, general manner: imposing class-aware constraints on a model's internal activation patterns. Specifically, we assign to each class a unique, fixed, randomly-generated binary vector - hereafter called class code - and train the model so that its cross-depths activation patterns predict the appropriate class code according to the input sample's class. The resulting predictors are dubbed total activation classifiers (TAC), and TACs may either be trained from scratch, or used with negligible cost as a thin add-on on top of a frozen, pre-trained neural network. In the add-on case, the original neural network's inference head is completely unaffected (so its accuracy remains the same) but we now have the option to use TAC's own confidence and prediction when determining which course of action to take in an hypothetical production workflow. In particular, we show that TAC strictly improves the value derived from models allowed to reject/defer. We provide further empirical evidence that TAC works well on multiple types of architectures and data modalities and that it is at least as good as state-of-the-art alternative confidence scores derived from existing models. Recent work has revealed interesting emerging properties for representations learned by neural networks (Papernot & McDaniel, 2018; Kalibhat et al., 2022; Bäuerle et al., 2022). In particular, simple class-dependent patterns were observed after training: there are groups of representations that consistently activate more strongly depending on high-level features of inputs. This behaviour can be useful to define predictors able to reject/defer test data that do not follow common patterns, provided that one can efficiently verify similarities between new data and common patterns. Well known limitations of this model class can then be addressed such as its lack of robustness to natural distribution shifts (Ben-David et al., 2006), or against small but carefully crafted perturbations to its inputs (Szegedy et al., 2013; Goodfellow et al., 2014). These empirical evidences and potential use cases naturally lead to the question: can we enforce simple class-dependent structure rather than hope it emerges? In this work, we address this question and show that one can indeed constrain representations to follow simple, class-dependent, and efficiently verifiable patterns on learned representations. In particular, we turn the label set into a set of hard-coded class-specific binary codes and define models such that activations obtained from different layers match those patterns.
Machine Learning Research Trends in Africa: A 30 Years Overview with Bibliometric Analysis Review
Ezugwu, Absalom E., Oyelade, Olaide N., Ikotun, Abiodun M., Agushaka, Jeffery O., Ho, Yuh-Shan
The machine learning (ML) paradigm has gained much popularity today. Its algorithmic models are employed in every field, such as natural language processing, pattern recognition, object detection, image recognition, earth observation and many other research areas. In fact, machine learning technologies and their inevitable impact suffice in many technological transformation agendas currently being propagated by many nations, for which the already yielded benefits are outstanding. From a regional perspective, several studies have shown that machine learning technology can help address some of Africa's most pervasive problems, such as poverty alleviation, improving education, delivering quality healthcare services, and addressing sustainability challenges like food security and climate change. In this state-of-the-art paper, a critical bibliometric analysis study is conducted, coupled with an extensive literature survey on recent developments and associated applications in machine learning research with a perspective on Africa. The presented bibliometric analysis study consists of 2761 machine learning-related documents, of which 89% were articles with at least 482 citations published in 903 journals during the past three decades. Furthermore, the collated documents were retrieved from the Science Citation Index EXPANDED, comprising research publications from 54 African countries between 1993 and 2021. The bibliometric study shows the visualization of the current landscape and future trends in machine learning research and its application to facilitate future collaborative research and knowledge exchange among authors from different research institutions scattered across the African continent.
Inferring High-level Geographical Concepts via Knowledge Graph and Multi-scale Data Integration: A Case Study of C-shaped Building Pattern Recognition
Wei, Zhiwei, Xiao, Yi, Xu, Wenjia, Shu, Mi, Cheng, Lu, Wang, Yang, Liu, Chunbo
Effective building pattern recognition is critical for understanding urban form, automating map generalization, and visualizing 3D city models. Most existing studies use object-independent methods based on visual perception rules and proximity graph models to extract patterns. However, because human vision is a part-based system, pattern recognition may require decomposing shapes into parts or grouping them into clusters. Existing methods may not recognize all visually aware patterns, and the proximity graph model can be inefficient. To improve efficiency and effectiveness, we integrate multi-scale data using a knowledge graph, focusing on the recognition of C-shaped building patterns. First, we use a property graph to represent the relationships between buildings within and across different scales involved in C-shaped building pattern recognition. Next, we store this knowledge graph in a graph database and convert the rules for C-shaped pattern recognition and enrichment into query conditions. Finally, we recognize and enrich C-shaped building patterns using rule-based reasoning in the built knowledge graph. We verify the effectiveness of our method using multi-scale data with three levels of detail (LODs) collected from the Gaode Map. Our results show that our method achieves a higher recall rate of 26.4% for LOD1, 20.0% for LOD2, and 9.1% for LOD3 compared to existing approaches. We also achieve recognition efficiency improvements of 0.91, 1.37, and 9.35 times, respectively.
What Makes a Good Dataset for Symbol Description Reading?
Lynch, Karol, Ploennigs, Joern, Eck, Bradley
The usage of mathematical formulas as concise representations of a document's key ideas is common practice. Correctly interpreting these formulas, by identifying mathematical symbols and extracting their descriptions, is an important task in document understanding. This paper makes the following contributions to the mathematical identifier description reading (MIDR) task: (i) introduces the Math Formula Question Answering Dataset (MFQuAD) with $7508$ annotated identifier occurrences; (ii) describes novel variations of the noun phrase ranking approach for the MIDR task; (iii) reports experimental results for the SOTA noun phrase ranking approach and our novel variations of the approach, providing problem insights and a performance baseline; (iv) provides a position on the features that make an effective dataset for the MIDR task.
Generalizable Implicit Neural Representations via Instance Pattern Composers
Kim, Chiheon, Lee, Doyup, Kim, Saehoon, Cho, Minsu, Han, Wook-Shin
Despite recent advances in implicit neural representations (INRs), it remains challenging for a coordinate-based multi-layer perceptron (MLP) of INRs to learn a common representation across data instances and generalize it for unseen instances. In this work, we introduce a simple yet effective framework for generalizable INRs that enables a coordinate-based MLP to represent complex data instances by modulating only a small set of weights in an early MLP layer as an instance pattern composer; the remaining MLP weights learn pattern composition rules for common representations across instances. Our generalizable INR framework is fully compatible with existing meta-learning and hypernetworks in learning to predict the modulated weight for unseen instances. Extensive experiments demonstrate that our method achieves high performance on a wide range of domains such as an audio, image, and 3D object, while the ablation study validates our weight modulation.
Learning Generalized Hybrid Proximity Representation for Image Recognition
Recently, deep metric learning techniques received attention, as the learned distance representations are useful to capture the similarity relationship among samples and further improve the performance of various of supervised or unsupervised learning tasks. We propose a novel supervised metric learning method that can learn the distance metrics in both geometric and probabilistic space for image recognition. In contrast to the previous metric learning methods which usually focus on learning the distance metrics in Euclidean space, our proposed method is able to learn better distance representation in a hybrid approach. To achieve this, we proposed a Generalized Hybrid Metric Loss (GHM-Loss) to learn the general hybrid proximity features from the image data by controlling the trade-off between geometric proximity and probabilistic proximity. To evaluate the effectiveness of our method, we first provide theoretical derivations and proofs of the proposed loss function, then we perform extensive experiments on two public datasets to show the advantage of our method compared to other state-of-the-art metric learning methods.
Ranking Loss and Sequestering Learning for Reducing Image Search Bias in Histopathology
Mazaheri, Pooria, Bidgoli, Azam Asilian, Rahnamayan, Shahryar, Tizhoosh, H. R.
Recently, deep learning has started to play an essential role in healthcare applications, including image search in digital pathology. Despite the recent progress in computer vision, significant issues remain for image searching in histopathology archives. A well-known problem is AI bias and lack of generalization. A more particular shortcoming of deep models is the ignorance toward search functionality. The former affects every model, the latter only search and matching. Due to the lack of ranking-based learning, researchers must train models based on the classification error and then use the resultant embedding for image search purposes. Moreover, deep models appear to be prone to internal bias even if using a large image repository of various hospitals. This paper proposes two novel ideas to improve image search performance. First, we use a ranking loss function to guide feature extraction toward the matching-oriented nature of the search. By forcing the model to learn the ranking of matched outputs, the representation learning is customized toward image search instead of learning a class label. Second, we introduce the concept of sequestering learning to enhance the generalization of feature extraction. By excluding the images of the input hospital from the matched outputs, i.e., sequestering the input domain, the institutional bias is reduced. The proposed ideas are implemented and validated through the largest public dataset of whole slide images. The experiments demonstrate superior results compare to the-state-of-art.
Bongard-HOI: Benchmarking Few-Shot Visual Reasoning for Human-Object Interactions
Jiang, Huaizu, Ma, Xiaojian, Nie, Weili, Yu, Zhiding, Zhu, Yuke, Zhu, Song-Chun, Anandkumar, Anima
A significant gap remains between today's visual pattern recognition models and human-level visual cognition especially when it comes to few-shot learning and compositional reasoning of novel concepts. We introduce Bongard-HOI, a new visual reasoning benchmark that focuses on compositional learning of human-object interactions (HOIs) from natural images. It is inspired by two desirable characteristics from the classical Bongard problems (BPs): 1) few-shot concept learning, and 2) context-dependent reasoning. We carefully curate the few-shot instances with hard negatives, where positive and negative images only disagree on action labels, making mere recognition of object categories insufficient to complete our benchmarks. We also design multiple test sets to systematically study the generalization of visual learning models, where we vary the overlap of the HOI concepts between the training and test sets of few-shot instances, from partial to no overlaps. Bongard-HOI presents a substantial challenge to today's visual recognition models. The state-of-the-art HOI detection model achieves only 62% accuracy on few-shot binary prediction while even amateur human testers on MTurk have 91% accuracy. With the Bongard-HOI benchmark, we hope to further advance research efforts in visual reasoning, especially in holistic perception-reasoning systems and better representation learning.