AITopics | Optical Character Recognition

Collaborating Authors

Optical Character Recognition

Our second example deals with a more challenging problem: the recognition of hand-printed letters of the alphabet. The characters that people print in the ordinary course of filling out forms and questionnaires are surprisingly varied. Gaps abound wherecontinuous lines might be expected; curves and sharp angles appear interchangeably; there is almost every imaginable distortion of slant, shape and size. Even human readers cannot always identify such characters; their error rate is about 3 per cent on randomly selected letters and numbers, seen out of context.
– from Oliver G. Selfridge & Ulric Neisser. PATTERN RECOGNITION BY MACHINE . In Computers & thought, Edward A. Feigenbaum and Julian Feldman (Eds.). MIT Press, Cambridge, MA, USA, 1963. pp. 8-30.

News Overviews Instructional Materials AI-Alerts Classics

MathReader : Text-to-Speech for Mathematical Documents

Hyeon, Sieun, Jung, Kyudan, Kim, Nam-Joon, Ryu, Hyun Gon, Do, Jaeyoung

arXiv.org Artificial IntelligenceJan-13-2025

TTS (Text-to-Speech) document reader from Microsoft, Adobe, Apple, and OpenAI have been serviced worldwide. They provide relatively good TTS results for general plain text, but sometimes skip contents or provide unsatisfactory results for mathematical expressions. This is because most modern academic papers are written in LaTeX, and when LaTeX formulas are compiled, they are rendered as distinctive text forms within the document. However, traditional TTS document readers output only the text as it is recognized, without considering the mathematical meaning of the formulas. To address this issue, we propose MathReader, which effectively integrates OCR, a fine-tuned T5 model, and TTS. MathReader demonstrated a lower Word Error Rate (WER) than existing TTS document readers, such as Microsoft Edge and Adobe Acrobat, when processing documents containing mathematical formulas. MathReader reduced the WER from 0.510 to 0.281 compared to Microsoft Edge, and from 0.617 to 0.281 compared to Adobe Acrobat. This will significantly contribute to alleviating the inconvenience faced by users who want to listen to documents, especially those who are visually impaired. The code is available at https://github.com/hyeonsieun/MathReader.

formula, large language model, machine learning, (19 more...)

arXiv.org Artificial Intelligence

2501.07088

Country: North America > United States (0.15)

Genre: Research Report (0.64)

Industry: Health & Medicine (0.37)

Technology:

Information Technology > Artificial Intelligence > Vision > Optical Character Recognition (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.89)

Add feedback

MARS6: A Small and Robust Hierarchical-Codec Text-to-Speech Model

Baas, Matthew, Scholtz, Pieter, Mehta, Arnav, Dyson, Elliott, Prakash, Akshat, Kamper, Herman

arXiv.org Artificial IntelligenceJan-10-2025

Codec-based text-to-speech (TTS) models have shown impressive quality with zero-shot voice cloning abilities. However, they often struggle with more expressive references or complex text inputs. We present MARS6, a robust encoder-decoder transformer for rapid, expressive TTS. MARS6 is built on recent improvements in spoken language modelling. Utilizing a hierarchical setup for its decoder, new speech tokens are processed at a rate of only 12 Hz, enabling efficient modelling of long-form text while retaining reconstruction quality. We combine several recent training and inference techniques to reduce repetitive generation and improve output stability and quality. This enables the 70M-parameter MARS6 to achieve similar performance to models many times larger. We show this in objective and subjective evaluations, comparing TTS output quality and reference speaker cloning ability. Project page: https://camb-ai.github.io/mars6-turbo/

artificial intelligence, machine learning, mars6, (18 more...)

arXiv.org Artificial Intelligence

2501.05787

Genre: Research Report (0.64)

Industry: Information Technology > Security & Privacy (0.35)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Speech > Speech Synthesis (0.88)
Information Technology > Artificial Intelligence > Vision > Optical Character Recognition (0.73)

Add feedback

Geometry Restoration and Dewarping of Camera-Captured Document Images

Istomin, Valery, Pereziabov, Oleg, Afanasyev, Ilya

arXiv.org Artificial IntelligenceJan-9-2025

This research focuses on developing a method for restoring the topology of digital images of paper documents captured by a camera, using algorithms for detection, segmentation, geometry restoration, and dewarping. Our methodology employs deep learning (DL) for document outline detection, followed by computer vision (CV) to create a topological 2D grid using cubic polynomial interpolation and correct nonlinear distortions by remapping the image. Using classical CV methods makes the document topology restoration process more efficient and faster, as it requires significantly fewer computational resources and memory. We developed a new pipeline for automatic document dewarping and reconstruction, along with a framework and annotated dataset to demonstrate its efficiency. Our experiments confirm the promise of our methodology and its superiority over existing benchmarks (including mobile apps and popular DL solutions, such as RectiNet, DocGeoNet, and DocTr++) both visually and in terms of document readability via Optical Character Recognition (OCR) and geometry restoration metrics. This paves the way for creating high-quality digital copies of paper documents and enhancing the efficiency of OCR systems. Project page: https://github.com/HorizonParadox/DRCCBI

algorithm, document image, geometry restoration, (15 more...)

arXiv.org Artificial Intelligence

2501.03145

Country:

Asia > Russia (0.04)
Europe > Russia > Northwestern Federal District > Leningrad Oblast > Saint Petersburg (0.04)
Asia > India (0.04)
(8 more...)

Genre: Research Report (0.81)

Industry:

Health & Medicine (0.67)
Media > Photography (0.48)

Technology:

Information Technology > Sensing and Signal Processing > Image Processing (1.00)
Information Technology > Artificial Intelligence > Vision > Optical Character Recognition (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Pattern Recognition (0.89)

Add feedback

Now your phone can do homework and help with your taxes

Popular ScienceJan-8-2025, 12:00:00 GMT

Every app claims to save you time, but here's one that actually does. It turns your phone into a personal assistant capable of digitizing receipts, doing your math homework, and editing documents. College students, anyone dreading tax season, and remote workers will get a kick out of iScanner. You can try the app for free, but unlocking the full, ad-free experience normally costs 19.99 per year. Instead, you could grab an iScanner lifetime subscription for 27.99 with code FESTIVE30 at checkout and save a fortune (reg.

code festive30, homework and help, iscanner, (2 more...)

Popular Science

Industry: Marketing (0.40)

Technology:

Information Technology > Communications (0.81)
Information Technology > Artificial Intelligence > Vision > Optical Character Recognition (0.40)

Add feedback

Efficient License Plate Recognition in Videos Using Visual Rhythm and Accumulative Line Analysis

Ribeiro, Victor Nascimento, Hirata, Nina S. T.

arXiv.org Artificial IntelligenceJan-8-2025

Video-based Automatic License Plate Recognition (ALPR) involves extracting vehicle license plate text information from video captures. Traditional systems typically rely heavily on high-end computing resources and utilize multiple frames to recognize license plates, leading to increased computational overhead. In this paper, we propose two methods capable of efficiently extracting exactly one frame per vehicle and recognizing its license plate characters from this single image, thus significantly reducing computational demands. The first method uses Visual Rhythm (VR) to generate time-spatial images from videos, while the second employs Accumulative Line Analysis (ALA), a novel algorithm based on single-line video processing for real-time operation. Both methods leverage YOLO for license plate detection within the frame and a Convolutional Neural Network (CNN) for Optical Character Recognition (OCR) to extract textual information. Experiments on real videos demonstrate that the proposed methods achieve results comparable to traditional frame-by-frame approaches, with processing speeds three times faster.

artificial intelligence, machine learning, recognition, (16 more...)

arXiv.org Artificial Intelligence

2501.0475

Country: South America > Brazil (0.70)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Vision > Optical Character Recognition (0.89)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.49)

Add feedback

How to read text from images on Windows

Popular ScienceJan-4-2025, 13:00:00 GMT

There are all kinds of reasons why you might want to extract text out of images. Maybe you've taken photos of restaurant bills and you want to make a record of what you've eaten; or perhaps you've got a bunch of screenshots that you need to get product names out of; or you could have scanned in some important documents that need sorting. Whatever the reason, Windows comes with built-in tools for picking out text from image files (technically known as OCR, or Optical Character Recognition)--in fact, there are several different ways, so you can pick the one that suits you best. Here's how to get started, assuming you already have your images saved somewhere. The Snipping Tool is the easiest way to extract text from images on Windows.

clipboard, screenshot, toolbar, (8 more...)

Popular Science

Technology: Information Technology > Artificial Intelligence > Vision > Optical Character Recognition (0.56)

Add feedback

OCRBench v2: An Improved Benchmark for Evaluating Large Multimodal Models on Visual Text Localization and Reasoning

Fu, Ling, Yang, Biao, Kuang, Zhebin, Song, Jiajun, Li, Yuzhe, Zhu, Linghao, Luo, Qidi, Wang, Xinyu, Lu, Hao, Huang, Mingxin, Li, Zhang, Tang, Guozhi, Shan, Bin, Lin, Chunhui, Liu, Qi, Wu, Binghong, Feng, Hao, Liu, Hao, Huang, Can, Tang, Jingqun, Chen, Wei, Jin, Lianwen, Liu, Yuliang, Bai, Xiang

arXiv.org Artificial IntelligenceDec-31-2024

Scoring the Optical Character Recognition (OCR) capabilities of Large Multimodal Models (LMMs) has witnessed growing interest recently. Existing benchmarks have highlighted the impressive performance of LMMs in text recognition; however, their abilities on certain challenging tasks, such as text localization, handwritten content extraction, and logical reasoning, remain underexplored. To bridge this gap, we introduce OCRBench v2, a large-scale bilingual text-centric benchmark with currently the most comprehensive set of tasks (4x more tasks than the previous multi-scene benchmark OCRBench), the widest coverage of scenarios (31 diverse scenarios including street scene, receipt, formula, diagram, and so on), and thorough evaluation metrics, with a total of 10,000 human-verified question-answering pairs and a high proportion of difficult samples. After carefully benchmarking state-of-the-art LMMs on OCRBench v2, we find that 20 out of 22 LMMs score below 50 (100 in total) and suffer from five-type limitations, including less frequently encountered text recognition, fine-grained perception, layout perception, complex element parsing, and logical reasoning. The benchmark and evaluation scripts are available at https://github.com/Yuliang-liu/MultimodalOCR.

large language model, machine learning, pattern recognition, (20 more...)

arXiv.org Artificial Intelligence

2501.00321

Country:

South America (0.04)
Africa (0.04)
North America > United States > Wisconsin > Marathon County > Wausau (0.04)
(5 more...)

Genre: Research Report (1.00)

Industry:

Education (0.67)
Banking & Finance (0.45)

Technology:

Information Technology > Sensing and Signal Processing > Image Processing (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Pattern Recognition (0.91)
Information Technology > Artificial Intelligence > Vision > Optical Character Recognition (0.68)

Add feedback

VORTEX: A Spatial Computing Framework for Optimized Drone Telemetry Extraction from First-Person View Flight Data

Gallagher, James E., Oughton, Edward J.

arXiv.org Artificial IntelligenceDec-24-2024

This paper presents the Visual Optical Recognition Telemetry EXtraction (VORTEX) system for extracting and analyzing drone telemetry data from First Person View (FPV) Uncrewed Aerial System (UAS) footage. VORTEX employs MMOCR, a PyTorch-based Optical Character Recognition (OCR) toolbox, to extract telemetry variables from drone Heads Up Display (HUD) recordings, utilizing advanced image preprocessing techniques, including CLAHE enhancement and adaptive thresholding. The study optimizes spatial accuracy and computational efficiency through systematic investigation of temporal sampling rates (1s, 5s, 10s, 15s, 20s) and coordinate processing methods. Results demonstrate that the 5-second sampling rate, utilizing 4.07% of available frames, provides the optimal balance with a point retention rate of 64% and mean speed accuracy within 4.2% of the 1-second baseline while reducing computational overhead by 80.5%. Comparative analysis of coordinate processing methods reveals that while UTM Zone 33N projection and Haversine calculations provide consistently similar results (within 0.1% difference), raw WGS84 coordinates underestimate distances by 15-30% and speeds by 20-35%. Altitude measurements showed unexpected resilience to sampling rate variations, with only 2.1% variation across all intervals. This research is the first of its kind, providing quantitative benchmarks for establishing a robust framework for drone telemetry extraction and analysis using open-source tools and spatial libraries.

artificial intelligence, calculation, machine learning, (20 more...)

arXiv.org Artificial Intelligence

2412.18505

Country:

North America > United States (0.68)
Asia > Middle East > Iraq (0.28)

Genre: Research Report > New Finding (0.66)

Industry:

Information Technology (0.93)
Government > Military (0.68)
Transportation > Air (0.46)

Technology:

Information Technology > Artificial Intelligence > Vision > Optical Character Recognition (1.00)
Information Technology > Artificial Intelligence > Robots > Autonomous Vehicles > Drones (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Spatial Reasoning (0.82)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.67)

Add feedback

LMRPA: Large Language Model-Driven Efficient Robotic Process Automation for OCR

Abdellaif, Osama Hosam, Nader, Abdelrahman, Hamdi, Ali

arXiv.org Artificial IntelligenceDec-23-2024

This paper introduces LMRPA, a novel Large Model-Driven Robotic Process Automation (RPA) model designed to greatly improve the efficiency and speed of Optical Character Recognition (OCR) tasks. Traditional RPA platforms often suffer from performance bottlenecks when handling high-volume repetitive processes like OCR, leading to a less efficient and more time-consuming process. LMRPA allows the integration of Large Language Models (LLMs) to improve the accuracy and readability of extracted text, overcoming the challenges posed by ambiguous characters and complex text structures.Extensive benchmarks were conducted comparing LMRPA to leading RPA platforms, including UiPath and Automation Anywhere, using OCR engines like Tesseract and DocTR. The results are that LMRPA achieves superior performance, cutting the processing times by up to 52\%. For instance, in Batch 2 of the Tesseract OCR task, LMRPA completed the process in 9.8 seconds, where UiPath finished in 18.1 seconds and Automation Anywhere finished in 18.7 seconds. Similar improvements were observed with DocTR, where LMRPA outperformed other automation tools conducting the same process by completing tasks in 12.7 seconds, while competitors took over 20 seconds to do the same. These findings highlight the potential of LMRPA to revolutionize OCR-driven automation processes, offering a more efficient and effective alternative solution to the existing state-of-the-art RPA models.

large language model, machine learning, pattern recognition, (17 more...)

arXiv.org Artificial Intelligence

2412.18063

Genre:

Research Report (1.00)
Overview (0.94)

Technology:

Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.97)
Information Technology > Artificial Intelligence > Vision > Optical Character Recognition (0.88)
Information Technology > Artificial Intelligence > Machine Learning > Pattern Recognition (0.68)

Add feedback

AI-assisted summary of suicide risk Formulation

Rana, Rajib, Higgins, Niall, Haque, Kazi N., Reilly, John, Burke, Kylie, Turner, Kathryn, Pisani, Anthony R., Stedman, Terry

arXiv.org Artificial IntelligenceDec-19-2024

Background: Formulation, associated with suicide risk assessment, is an individualised process that seeks to understand the idiosyncratic nature and development of an individual's problems. Auditing clinical documentation on an electronic health record (EHR) is challenging as it requires resource-intensive manual efforts to identify keywords in relevant sections of specific forms. Furthermore, clinicians and healthcare professionals often do not use keywords; their clinical language can vary greatly and may contain various jargon and acronyms. Also, the relevant information may be recorded elsewhere. This study describes how we developed advanced Natural Language Processing (NLP) algorithms, a branch of Artificial Intelligence (AI), to analyse EHR data automatically. Method: Advanced Optical Character Recognition techniques were used to process unstructured data sets, such as portable document format (pdf) files. Free text data was cleaned and pre-processed using Normalisation of Free Text techniques. We developed algorithms and tools to unify the free text. Finally, the formulation was checked for the presence of each concept based on similarity using NLP-powered semantic matching techniques. Results: We extracted information indicative of formulation and assessed it to cover the relevant concepts. This was achieved using a Weighted Score to obtain a Confidence Level. Conclusion: The rigour to which formulation is completed is crucial to effectively using EHRs, ensuring correct and timely identification, engagement and interventions that may potentially avoid many suicide attempts and suicides.

artificial intelligence, optical character recognition, suicide risk formulation, (2 more...)

arXiv.org Artificial Intelligence

2412.10388

Genre: Research Report (0.66)

Industry: Health & Medicine > Therapeutic Area > Psychiatry/Psychology > Mental Health (0.89)

Technology:

Information Technology > Artificial Intelligence > Natural Language (0.87)
Information Technology > Artificial Intelligence > Vision > Optical Character Recognition (0.53)

Add feedback