Text Recognition
Improving Automatic Text Recognition with Language Models in the PyLaia Open-Source Library
Tarride, Solène, Schneider, Yoann, Generali-Lince, Marie, Boillet, Mélodie, Abadie, Bastien, Kermorvant, Christopher
PyLaia is one of the most popular open-source software for Automatic Text Recognition (ATR), delivering strong performance in terms of speed and accuracy. In this paper, we outline our recent contributions to the PyLaia library, focusing on the incorporation of reliable confidence scores and the integration of statistical language modeling during decoding. Our implementation provides an easy way to combine PyLaia with n-grams language models at different levels. One of the highlights of this work is that language models are completely auto-tuned: they can be built and used easily without any expert knowledge, and without requiring any additional data. To demonstrate the significance of our contribution, we evaluate PyLaia's performance on twelve datasets, both with and without language modelling. The results show that decoding with small language models improves the Word Error Rate by 13% and the Character Error Rate by 12% in average. Additionally, we conduct an analysis of confidence scores and highlight the importance of calibration techniques.
- North America > United States > New York > New York County > New York City (0.14)
- Europe > United Kingdom > Scotland > City of Edinburgh > Edinburgh (0.04)
- Europe > Switzerland (0.04)
- (2 more...)
- Information Technology > Artificial Intelligence > Natural Language (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Pattern Recognition > Text Recognition (0.63)
GatedLexiconNet: A Comprehensive End-to-End Handwritten Paragraph Text Recognition System
Kumari, Lalita, Singh, Sukhdeep, Rathore, Vaibhav Varish Singh, Sharma, Anuj
The Handwritten Text Recognition problem has been a challenge for researchers for the last few decades, especially in the domain of computer vision, a subdomain of pattern recognition. Variability of texts amongst writers, cursiveness, and different font styles of handwritten texts with degradation of historical text images make it a challenging problem. Recognizing scanned document images in neural network-based systems typically involves a two-step approach: segmentation and recognition. However, this method has several drawbacks. These shortcomings encompass challenges in identifying text regions, analyzing layout diversity within pages, and establishing accurate ground truth segmentation. Consequently, these processes are prone to errors, leading to bottlenecks in achieving high recognition accuracies. Thus, in this study, we present an end-to-end paragraph recognition system that incorporates internal line segmentation and gated convolutional layers based encoder. The gating is a mechanism that controls the flow of information and allows to adaptively selection of the more relevant features in handwritten text recognition models. The attention module plays an important role in performing internal line segmentation, allowing the page to be processed line-by-line. During the decoding step, we have integrated a connectionist temporal classification-based word beam search decoder as a post-processing step. In this work, we have extended existing LexiconNet by carefully applying and utilizing gated convolutional layers in the existing deep neural network. Our results at line and page levels also favour our new GatedLexiconNet. This study reported character error rates of 2.27% on IAM, 0.9% on RIMES, and 2.13% on READ-16, and word error rates of 5.73% on IAM, 2.76% on RIMES, and 6.52% on READ-2016 datasets.
- North America > United States (0.04)
- Europe > Spain (0.04)
- Asia > India > Chandigarh (0.04)
- (4 more...)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Pattern Recognition > Text Recognition (0.83)
JSTR: Judgment Improves Scene Text Recognition
In this paper, we present a method for enhancing the accuracy of scene text recognition tasks by judging whether the image and text match each other. While previous studies focused on generating the recognition results from input images, our approach also considers the model's misrecognition results to understand its error tendencies, thus improving the text recognition pipeline. This method boosts text recognition accuracy by providing explicit feedback on the data that the model is likely to misrecognize by predicting correct or incorrect between the image and text. The experimental results on publicly available datasets demonstrate that our proposed method outperforms the baseline and state-of-the-art methods in scene text recognition.
- South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
- Asia > Japan (0.04)
- Information Technology > Artificial Intelligence > Machine Learning > Pattern Recognition > Text Recognition (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Generative Shape Models: Joint Text Recognition and Segmentation with Very Little Training Data
Abstract: We demonstrate that a generative model for object shapes can achieve state of the art results on challenging scene text recognition tasks, and with orders of magnitude fewer training images than required for competing discriminative methods. In addition to transcribing text from challenging images, our method performs fine-grained instance segmentation of characters. We show that our model is more robust to both affine transformations and non-affine deformations compared to previous approaches.
- North America > United States > California > San Francisco County > San Francisco (0.04)
- Europe > Spain > Catalonia > Barcelona Province > Barcelona (0.04)
Lumos : Empowering Multimodal LLMs with Scene Text Recognition
Shenoy, Ashish, Lu, Yichao, Jayakumar, Srihari, Chatterjee, Debojeet, Moslehpour, Mohsen, Chuang, Pierce, Harpale, Abhay, Bhardwaj, Vikas, Xu, Di, Zhao, Shicong, Zhao, Longfang, Ramchandani, Ankit, Dong, Xin Luna, Kumar, Anuj
We introduce Lumos, the first end-to-end multimodal question-answering system with text understanding capabilities. At the core of Lumos is a Scene Text Recognition (STR) component that extracts text from first person point-of-view images, the output of which is used to augment input to a Multimodal Large Language Model (MM-LLM). While building Lumos, we encountered numerous challenges related to STR quality, overall latency, and model inference. In this paper, we delve into those challenges, and discuss the system architecture, design choices, and modeling techniques employed to overcome these obstacles. We also provide a comprehensive evaluation for each component, showcasing high quality and efficiency.
- Europe > Spain > Catalonia > Barcelona Province > Barcelona (0.05)
- North America > United States > New York > New York County > New York City (0.04)
- Europe > United Kingdom > England > Greater London > London (0.04)
- Europe > Greece (0.04)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.93)
- Information Technology > Artificial Intelligence > Machine Learning > Pattern Recognition > Text Recognition (0.63)
Finally! Windows 11's Snipping Tool will let you copy text from screenshots
Well now this will be useful! Microsoft is adding a text recognition function (OCR) to the Windows 11 Snipping Tool. The new feature will let you to copy text from screenshots and paste it into word processing programs, for example. Currently, only Windows Insider testers from the Canary and Dev channels can try the new text copying feature in the Snipping Tool, though if all goes well you can expect to see it hit all Windows 11 machines at some point in the future. The new function, called "Text Actions," is available in Snipping Tool version 11.2308.33.0.
Towards Large-scale Building Attribute Mapping using Crowdsourced Images: Scene Text Recognition on Flickr and Problems to be Solved
Sun, Yao, Kruspe, Anna, Meng, Liqiu, Tian, Yifan, Hoffmann, Eike J, Auer, Stefan, Zhu, Xiao Xiang
Crowdsourced platforms provide huge amounts of street-view images that contain valuable building information. This work addresses the challenges in applying Scene Text Recognition (STR) in crowdsourced street-view images for building attribute mapping. We use Flickr images, particularly examining texts on building facades. A Berlin Flickr dataset is created, and pre-trained STR models are used for text detection and recognition. Manual checking on a subset of STR-recognized images demonstrates high accuracy. We examined the correlation between STR results and building functions, and analysed instances where texts were recognized on residential buildings but not on commercial ones. Further investigation revealed significant challenges associated with this task, including small text regions in street-view images, the absence of ground truth labels, and mismatches in buildings in Flickr images and building footprints in OpenStreetMap (OSM). To develop city-wide mapping beyond urban hotspot locations, we suggest differentiating the scenarios where STR proves effective while developing appropriate algorithms or bringing in additional data for handling other cases. Furthermore, interdisciplinary collaboration should be undertaken to understand the motivation behind building photography and labeling. The STR-on-Flickr results are publicly available at https://github.com/ya0-sun/STR-Berlin.
- Information Technology > Communications > Social Media > Crowdsourcing (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.69)
- Information Technology > Artificial Intelligence > Machine Learning > Pattern Recognition > Text Recognition (0.62)
UTRNet: High-Resolution Urdu Text Recognition In Printed Documents
Rahman, Abdur, Ghosh, Arjun, Arora, Chetan
In this paper, we propose a novel approach to address the challenges of printed Urdu text recognition using high-resolution, multi-scale semantic feature extraction. Our proposed UTRNet architecture, a hybrid CNN-RNN model, demonstrates state-of-the-art performance on benchmark datasets. To address the limitations of previous works, which struggle to generalize to the intricacies of the Urdu script and the lack of sufficient annotated real-world data, we have introduced the UTRSet-Real, a large-scale annotated real-world dataset comprising over 11,000 lines and UTRSet-Synth, a synthetic dataset with 20,000 lines closely resembling real-world and made corrections to the ground truth of the existing IIITH dataset, making it a more reliable resource for future research. We also provide UrduDoc, a benchmark dataset for Urdu text line detection in scanned documents. Additionally, we have developed an online tool for end-to-end Urdu OCR from printed documents by integrating UTRNet with a text detection model. Our work not only addresses the current limitations of Urdu OCR but also paves the way for future research in this area and facilitates the continued advancement of Urdu OCR technology. The project page with source code, datasets, annotations, trained models, and online tool is available at abdur75648.github.io/UTRNet.
- Asia > India (0.14)
- North America > United States > New York > New York County > New York City (0.04)
- Research Report > New Finding (0.46)
- Research Report > Promising Solution (0.34)
- Overview > Innovation (0.34)
- Information Technology > Artificial Intelligence > Vision (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Pattern Recognition > Text Recognition (0.65)
CLIPTER: Looking at the Bigger Picture in Scene Text Recognition
Aberdam, Aviad, Bensaïd, David, Golts, Alona, Ganz, Roy, Nuriel, Oren, Tichauer, Royee, Mazor, Shai, Litman, Ron
Reading text in real-world scenarios often requires understanding the context surrounding it, especially when dealing with poor-quality text. However, current scene text recognizers are unaware of the bigger picture as they operate on cropped text images. In this study, we harness the representative capabilities of modern vision-language models, such as CLIP, to provide scene-level information to the crop-based recognizer. We achieve this by fusing a rich representation of the entire image, obtained from the vision-language model, with the recognizer word-level features via a gated cross-attention mechanism. This component gradually shifts to the context-enhanced representation, allowing for stable fine-tuning of a pretrained recognizer. We demonstrate the effectiveness of our model-agnostic framework, CLIPTER (CLIP TExt Recognition), on leading text recognition architectures and achieve state-of-the-art results across multiple benchmarks. Furthermore, our analysis highlights improved robustness to out-of-vocabulary words and enhanced generalization in low-data regimes.
- Europe > Austria (0.04)
- Asia > Middle East > Israel (0.04)
- Information Technology > Artificial Intelligence > Vision (1.00)
- Information Technology > Artificial Intelligence > Natural Language (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Pattern Recognition > Text Recognition (0.84)
Handwritten Text Recognition from Crowdsourced Annotations
Tarride, Solène, Faine, Tristan, Boillet, Mélodie, Mouchère, Harold, Kermorvant, Christopher
In this paper, we explore different ways of training a model for handwritten text recognition when multiple imperfect or noisy transcriptions are available. We consider various training configurations, such as selecting a single transcription, retaining all transcriptions, or computing an aggregated transcription from all available annotations. In addition, we evaluate the impact of quality-based data selection, where samples with low agreement are removed from the training set. Our experiments are carried out on municipal registers of the city of Belfort (France) written between 1790 and 1946. % results The results show that computing a consensus transcription or training on multiple transcriptions are good alternatives. However, selecting training samples based on the degree of agreement between annotators introduces a bias in the training data and does not improve the results. Our dataset is publicly available on Zenodo: https://zenodo.org/record/8041668.
- North America > United States > California > Santa Clara County > San Jose (0.05)
- Europe > France > Pays de la Loire > Loire-Atlantique > Nantes (0.05)
- Europe > France > Île-de-France > Paris > Paris (0.05)
- North America > United States > New York > New York County > New York City (0.04)
- Information Technology > Artificial Intelligence > Vision > Handwriting Recognition (0.71)
- Information Technology > Artificial Intelligence > Machine Learning > Pattern Recognition > Text Recognition (0.62)
- Information Technology > Communications > Social Media > Crowdsourcing (0.53)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.47)