Optical Character Recognition
ProtoSnap: Prototype Alignment for Cuneiform Signs
Mikulinsky, Rachel, Alper, Morris, Gordin, Shai, Jiménez, Enrique, Cohen, Yoram, Averbuch-Elor, Hadar
The cuneiform writing system served as the medium for transmitting knowledge in the ancient Near East for a period of over three thousand years. Cuneiform signs have a complex internal structure which is the subject of expert paleographic analysis, as variations in sign shapes bear witness to historical developments and transmission of writing and culture over time. However, prior automated techniques mostly treat sign types as categorical and do not explicitly model their highly varied internal configurations. In this work, we present an unsupervised approach for recovering the fine-grained internal configuration of cuneiform signs by leveraging powerful generative models and the appearance and structure of prototype font images as priors. Our approach, ProtoSnap, enforces structural consistency on matches found with deep image features to estimate the diverse configurations of cuneiform characters, snapping a skeleton-based template to photographed cuneiform signs. We provide a new benchmark of expert annotations and evaluate our method on this task. Our evaluation shows that our approach succeeds in aligning prototype skeletons to a wide variety of cuneiform signs. Moreover, we show that conditioning on structures produced by our method allows for generating synthetic data with correct structural configurations, significantly boosting the performance of cuneiform sign recognition beyond existing techniques, in particular over rare signs. Cuneiform signs have complex internal structures which varied significantly across the eras, cultures, and geographic regions among which cuneiform writing was used. The study of these variations is part of a field called paleography, which is crucial for understanding the historical context of attested writing (Biggs, 1973; Homburg, 2021). However, while computational methods show promise for aiding experts in analyzing cuneiform texts (Bogacz and Mara, 2022), they are challenged by the vast variety of complex sign variants and their visual nature: Represented as wedge-shaped imprints in clay tablets which have often sustained physical damage, cuneiform appears as shadows on a non-uniform clay surface which may even be difficult for human experts to identify under non-optimal lighting conditions (Taylor, 2015).
Review for NeurIPS paper: Glow-TTS: A Generative Flow for Text-to-Speech via Monotonic Alignment Search
Weaknesses: I was a little confused about how the grouped 1x1 convolutions interact with the coupling layers. If the standard (half-and-half) partitioning is used for the coupling layers and the grouped 1x1 convolutions never mix channels outside of their group of 4, then half of the channels will never be transformed by any coupling layer. I'm assuming the authors deal with this issue somehow (since the results are good), but I only briefly scanned the code and didn't want to work through all of the index gymnastics. I could see readers being confused by these missing details. Update: In their response, the authors said they will explain more of the details of the grouped 1x1 convolutions in their revised version.
LoCoML: A Framework for Real-World ML Inference Pipelines
Maddireddy, Kritin, Methukula, Santhosh Kotekal, Sridhar, Chandrasekar, Vaidhyanathan, Karthik
The widespread adoption of machine learning (ML) has brought forth diverse models with varying architectures, and data requirements, introducing new challenges in integrating these systems into real-world applications. Traditional solutions often struggle to manage the complexities of connecting heterogeneous models, especially when dealing with varied technical specifications. These limitations are amplified in large-scale, collaborative projects where stakeholders contribute models with different technical specifications. To address these challenges, we developed LoCoML, a low-code framework designed to simplify the integration of diverse ML models within the context of the \textit{Bhashini Project} - a large-scale initiative aimed at integrating AI-driven language technologies such as automatic speech recognition, machine translation, text-to-speech, and optical character recognition to support seamless communication across more than 20 languages. Initial evaluations show that LoCoML adds only a small amount of computational load, making it efficient and effective for large-scale ML integration. Our practical insights show that a low-code approach can be a practical solution for connecting multiple ML models in a collaborative environment.
Comparative analysis of optical character recognition methods for S\'ami texts from the National Library of Norway
Enstad, Tita, Trosterud, Trond, Røsok, Marie Iversdatter, Beyer, Yngvil, Roald, Marie
Optical Character Recognition (OCR) is crucial to the National Library of Norway's (NLN) digitisation process as it converts scanned documents into machine-readable text. However, for the S\'ami documents in NLN's collection, the OCR accuracy is insufficient. Given that OCR quality affects downstream processes, evaluating and improving OCR for text written in S\'ami languages is necessary to make these resources accessible. To address this need, this work fine-tunes and evaluates three established OCR approaches, Transkribus, Tesseract and TrOCR, for transcribing S\'ami texts from NLN's collection. Our results show that Transkribus and TrOCR outperform Tesseract on this task, while Tesseract achieves superior performance on an out-of-domain dataset. Furthermore, we show that fine-tuning pre-trained models and supplementing manual annotations with machine annotations and synthetic text images can yield accurate OCR for S\'ami languages, even with a moderate amount of manually annotated data.
MathReader : Text-to-Speech for Mathematical Documents
Hyeon, Sieun, Jung, Kyudan, Kim, Nam-Joon, Ryu, Hyun Gon, Do, Jaeyoung
TTS (Text-to-Speech) document reader from Microsoft, Adobe, Apple, and OpenAI have been serviced worldwide. They provide relatively good TTS results for general plain text, but sometimes skip contents or provide unsatisfactory results for mathematical expressions. This is because most modern academic papers are written in LaTeX, and when LaTeX formulas are compiled, they are rendered as distinctive text forms within the document. However, traditional TTS document readers output only the text as it is recognized, without considering the mathematical meaning of the formulas. To address this issue, we propose MathReader, which effectively integrates OCR, a fine-tuned T5 model, and TTS. MathReader demonstrated a lower Word Error Rate (WER) than existing TTS document readers, such as Microsoft Edge and Adobe Acrobat, when processing documents containing mathematical formulas. MathReader reduced the WER from 0.510 to 0.281 compared to Microsoft Edge, and from 0.617 to 0.281 compared to Adobe Acrobat. This will significantly contribute to alleviating the inconvenience faced by users who want to listen to documents, especially those who are visually impaired. The code is available at https://github.com/hyeonsieun/MathReader.
MARS6: A Small and Robust Hierarchical-Codec Text-to-Speech Model
Baas, Matthew, Scholtz, Pieter, Mehta, Arnav, Dyson, Elliott, Prakash, Akshat, Kamper, Herman
Codec-based text-to-speech (TTS) models have shown impressive quality with zero-shot voice cloning abilities. However, they often struggle with more expressive references or complex text inputs. We present MARS6, a robust encoder-decoder transformer for rapid, expressive TTS. MARS6 is built on recent improvements in spoken language modelling. Utilizing a hierarchical setup for its decoder, new speech tokens are processed at a rate of only 12 Hz, enabling efficient modelling of long-form text while retaining reconstruction quality. We combine several recent training and inference techniques to reduce repetitive generation and improve output stability and quality. This enables the 70M-parameter MARS6 to achieve similar performance to models many times larger. We show this in objective and subjective evaluations, comparing TTS output quality and reference speaker cloning ability. Project page: https://camb-ai.github.io/mars6-turbo/
Geometry Restoration and Dewarping of Camera-Captured Document Images
Istomin, Valery, Pereziabov, Oleg, Afanasyev, Ilya
This research focuses on developing a method for restoring the topology of digital images of paper documents captured by a camera, using algorithms for detection, segmentation, geometry restoration, and dewarping. Our methodology employs deep learning (DL) for document outline detection, followed by computer vision (CV) to create a topological 2D grid using cubic polynomial interpolation and correct nonlinear distortions by remapping the image. Using classical CV methods makes the document topology restoration process more efficient and faster, as it requires significantly fewer computational resources and memory. We developed a new pipeline for automatic document dewarping and reconstruction, along with a framework and annotated dataset to demonstrate its efficiency. Our experiments confirm the promise of our methodology and its superiority over existing benchmarks (including mobile apps and popular DL solutions, such as RectiNet, DocGeoNet, and DocTr++) both visually and in terms of document readability via Optical Character Recognition (OCR) and geometry restoration metrics. This paves the way for creating high-quality digital copies of paper documents and enhancing the efficiency of OCR systems. Project page: https://github.com/HorizonParadox/DRCCBI
Now your phone can do homework and help with your taxes
Every app claims to save you time, but here's one that actually does. It turns your phone into a personal assistant capable of digitizing receipts, doing your math homework, and editing documents. College students, anyone dreading tax season, and remote workers will get a kick out of iScanner. You can try the app for free, but unlocking the full, ad-free experience normally costs 19.99 per year. Instead, you could grab an iScanner lifetime subscription for 27.99 with code FESTIVE30 at checkout and save a fortune (reg.
Efficient License Plate Recognition in Videos Using Visual Rhythm and Accumulative Line Analysis
Ribeiro, Victor Nascimento, Hirata, Nina S. T.
Video-based Automatic License Plate Recognition (ALPR) involves extracting vehicle license plate text information from video captures. Traditional systems typically rely heavily on high-end computing resources and utilize multiple frames to recognize license plates, leading to increased computational overhead. In this paper, we propose two methods capable of efficiently extracting exactly one frame per vehicle and recognizing its license plate characters from this single image, thus significantly reducing computational demands. The first method uses Visual Rhythm (VR) to generate time-spatial images from videos, while the second employs Accumulative Line Analysis (ALA), a novel algorithm based on single-line video processing for real-time operation. Both methods leverage YOLO for license plate detection within the frame and a Convolutional Neural Network (CNN) for Optical Character Recognition (OCR) to extract textual information. Experiments on real videos demonstrate that the proposed methods achieve results comparable to traditional frame-by-frame approaches, with processing speeds three times faster.