Goto

Collaborating Authors

 Optical Character Recognition


EPIC TTS Models: Empirical Pruning Investigations Characterizing Text-To-Speech Models

arXiv.org Artificial Intelligence

Neural models are known to be over-parameterized, and recent work has shown that sparse text-to-speech (TTS) models can outperform dense models. Although a plethora of sparse methods has been proposed for other domains, such methods have rarely been applied in TTS. In this work, we seek to answer the question: what are the characteristics of selected sparse techniques on the performance and model complexity? We compare a Tacotron2 baseline and the results of applying five techniques. We then evaluate the performance via the factors of naturalness, intelligibility and prosody, while reporting model size and training time. Complementary to prior research, we find that pruning before or during training can achieve similar performance to pruning after training and can be trained much faster, while removing entire neurons degrades performance much more than removing parameters. To our best knowledge, this is the first work that compares sparsity paradigms in text-to-speech synthesis.


PowerToys update adds OCR and two more free tools

PCWorld

If you use Windows, you want PowerToys. This collection of open-source goodies, guided and published by Microsoft itself, is one of the best free software packages out there, and we can't recommend it enough. That only becomes more true today, as the company publishes an updated version with three brand new tools: the previously-spotted Text Extrator (an Optical Character Recognition tool), a ruler for measuring pixels on your screen, and a tool for quickly inserting little-used accents into text. Text Extractor is probably the most universally-applicable addition here. It's an open-source version of Joseph Finney's paid Text Grab app, now integrated into PowerToys and free for Windows users.


Rabobank Australia and New Zealand Inks Deal with nCino

#artificialintelligence

This partnership will benefit the bank's Australian and New Zealand employees and customers, representing a multi-currency, cross-country commitment to provide a better banking experience. "By partnering with nCino, we will optimise our financial spreading analysis," said Alexa Glynn, Chief Operating Officer at RANZ. "This relationship will provide an excellent opportunity for RANZ to support our growing customer base and modernise our systems. We're delighted that nCino's technology will enable us to offer our customers and employees a better banking experience." The world's leading specialist food and agribusiness bank, Rabobank is one of Australia and New Zealand's largest agricultural lenders and a major provider of business and corporate banking services to the country's food and agribusiness sector. By adopting the nCino Bank Operating System, RANZ gains a digital solution that intelligently transforms the process of spreading financials by leveraging machine learning and optical character recognition (OCR).


A Black-Box Attack on Optical Character Recognition Systems

arXiv.org Artificial Intelligence

Adversarial machine learning is an emerging area showing the vulnerability of deep learning models. Exploring attack methods to challenge state of the art artificial intelligence (A.I.) models is an area of critical concern. The reliability and robustness of such A.I. models are one of the major concerns with an increasing number of effective adversarial attack methods. Classification tasks are a major vulnerable area for adversarial attacks. The majority of attack strategies are developed for colored or gray-scaled images. Consequently, adversarial attacks on binary image recognition systems have not been sufficiently studied. Binary images are simple two possible pixel-valued signals with a single channel. The simplicity of binary images has a significant advantage compared to colored and gray scaled images, namely computation efficiency. Moreover, most optical character recognition systems (O.C.R.s), such as handwritten character recognition, plate number identification, and bank check recognition systems, use binary images or binarization in their processing steps. In this paper, we propose a simple yet efficient attack method, Efficient Combinatorial Black-box Adversarial Attack, on binary image classifiers. We validate the efficiency of the attack technique on two different data sets and three classification networks, demonstrating its performance. Furthermore, we compare our proposed method with state-of-the-art methods regarding advantages and disadvantages as well as applicability.


OCR is getting super cool for Businesses

#artificialintelligence

A Few months back, the student in class captured the image of the notes made by the other student in front of him and used iOS 15's recent text-recognition feature to highlight text, and copy and paste it into his notes. This instance was tweeted by @juanbuis, who shared the video of a student making the most of iOS 15's Live Text OCR feature. This cool OCR or Optical Character Recognition feature that the above student opts for is generally used to pull up the information from the text or documents and then convert it into the machine's language. Recently, the popular app developer Alessandro Paluzzi has also seen that Twitter is working on an OCR (optical character recognition) feature for the description of alt text. In his tweet, Alessandro Paluzzi shared the demonstration of how this twitter feature will function through a short video. At Dwarf AI we too want to make this super cool technology to be easily accessible by other businesses.


Optical Character Recognition โ€ข OCR โ€ข AI Terminology โ€ข AI Blog

#artificialintelligence

Optical character recognition (OCR) is a type of technology that allows computers to read text from images and convert it into digital data. This can be extremely useful for digitizing old documents or converting scanned images into editable text. OCR technology relies on pattern recognition to identify the shapes of letters and numbers in an image. It then compares these patterns to a database of known characters, in order to determine what the text says. OCR technology has come a long way in recent years, and can now reliably recognize a wide range of fonts and languages. However, it is not perfect, and can sometimes struggle with words that are poorly lit, blurry, or in a particularly ornate font.


An End-to-End OCR Framework for Robust Arabic-Handwriting Recognition using a Novel Transformers-based Model and an Innovative 270 Million-Words Multi-Font Corpus of Classical Arabic with Diacritics

arXiv.org Artificial Intelligence

This research is the second phase in a series of investigations on developing an Optical Character Recognition (OCR) of Arabic historical documents and examining how different modeling procedures interact with the problem. The first research studied the effect of Transformers on our custom-built Arabic dataset. One of the downsides of the first research was the size of the training data, a mere 15000 images from our 30 million images, due to lack of resources. Also, we add an image enhancement layer, time and space optimization, and Post-Correction layer to aid the model in predicting the correct word for the correct context. Notably, we propose an end-to-end text recognition approach using Vision Transformers as an encoder, namely BEIT, and vanilla Transformer as a decoder, eliminating CNNs for feature extraction and reducing the model's complexity. The experiments show that our end-to-end model outperforms Convolutions Backbones. The model attained a CER of 4.46%.


SOMOS: The Samsung Open MOS Dataset for the Evaluation of Neural Text-to-Speech Synthesis

arXiv.org Artificial Intelligence

In this work, we present the SOMOS dataset, the first large-scale mean opinion scores (MOS) dataset consisting of solely neural text-to-speech (TTS) samples. It can be employed to train automatic MOS prediction systems focused on the assessment of modern synthesizers, and can stimulate advancements in acoustic model evaluation. It consists of 20K synthetic utterances of the LJ Speech voice, a public domain speech dataset which is a common benchmark for building neural acoustic models and vocoders. Utterances are generated from 200 TTS systems including vanilla neural acoustic models as well as models which allow prosodic variations. An LPCNet vocoder is used for all systems, so that the samples' variation depends only on the acoustic models. The synthesized utterances provide balanced and adequate domain and length coverage. We collect MOS naturalness evaluations on 3 English Amazon Mechanical Turk locales and share practices leading to reliable crowdsourced annotations for this task. We provide baseline results of state-of-the-art MOS prediction models on the SOMOS dataset and show the limitations that such models face when assigned to evaluate TTS utterances.


To show or not to show: Redacting sensitive text from videos of electronic displays

arXiv.org Artificial Intelligence

This, combined with major developments in computer vision and machine learning technology, has created enormous opportunities to make life better through the collection and utilization of this video data. Potential applications here range from improved security to interactive entertainment. However, the collection and utilization of this data also entails ethical privacy concerns and the potential for unwanted intrusion into people's lives without their permission. One way to attempt to achieve the benefits of more omnipresent video collection while mitigating the intrusion on privacy is through the automatic redaction of personally identifiable information (PII). This means automatically removing or obscuring content from video data that can be used to identify an individual while maintaining as much other video data as possible. A relatively new context generating a significant amount of video data is the cabins of automobiles.


The next PowerToy will give your PC easy OCR powers

PCWorld

We love us some PowerToys here at PCWorld, and it seems the semi-official add-on for Windows power users is only getting better. A recent update to the GitHub project for PowerToys indicates that "PowerOCR" is in the latter stages of approval and should land in the official app before too long. The tool will add in an easy way for Windows users to activate Optical Character Recognition (OCR) via a quick screenshot interface. NeoWin spotted the changes to the PowerToys Github, with extensive documentation of the new PowerOCR tool and an apparent approval nod from a Microsoft manager. The tool is mostly the work of independent developer Joesph Finney, contributing code that works similarly to his paid Text Grab app.