Optical Character Recognition
A Beginner's Guide to Language Models
A language model uses machine learning to conduct a probability distribution over words used to predict the most likely next word in a sentence based on the previous entry. Language models learn from text and can be used for producing original text, predicting the next word in a text, speech recognition, optical character recognition and handwriting recognition.
AI + OCR - A Key Ingredient To Digital
Countless human hours are required to manually extract the data into a machine-readable format. This process is known as ETL (extract, transform, and load). Insurers that can maximize their ETL capabilities have a powerful competitive advantage. Optical character recognition, also known as text recognition, converts text from scanned paper documents, photos, books, and PDF files into a machine-readable format, isn't new. What is new is coupling OCR with AI and machine-learning algorithms to reliably generate text that can be processed, indexed, and retrieved.
The Digital Insider
While there are all kinds of tips and tools to help you multitask, sometimes the best solutions are hiding in plain sight. A text-to-speech converter is one of those simple things that can help you listen to documents you have to read while working on something else, or add quality narration to videos and seminars to save you time from recording voices yourself. There are myriad applications, and Notevibes is one of the best solutions on the market.
Making your own document scanner in 40 lines of code
One of the benefits of being proficient with Machine Learning is having a good understanding of the algorithms that run some of the wonderful features we see on our devices. When Apple, the computer device manufacturing company, released the iOS16 version, one of the new functionalities was the ability to use the default Notes app as a digital scanner, think of it as a "scanner in your palm", borrowing a similar phrase from the legendary Steve Jobs. Prior to when it was introduced, I had to use other services usually apps downloaded from the App Store for the purpose of scanning documents with my phones, some paid some free and some of the free apps come with the disadvantage of a watermark which somewhat defeats the purpose unless you subscribe to a paid version. Having worked on a number of computer vision projects, I thought, would it be possible there is some computer vision library or ML algorithm one can use to replicate what's been done in my phone? In this article, we will be using a very popular library familiar to most MLEs familiar with deep learning particular computer vision: OpenCV.
PromptTTS: Controllable Text-to-Speech with Text Descriptions
Guo, Zhifang, Leng, Yichong, Wu, Yihan, Zhao, Sheng, Tan, Xu
Using a text description as prompt to guide the generation of text or images (e.g., GPT-3 or DALLE-2) has drawn wide attention recently. Beyond text and image generation, in this work, we explore the possibility of utilizing text descriptions to guide speech synthesis. Thus, we develop a text-to-speech (TTS) system (dubbed as PromptTTS) that takes a prompt with both style and content descriptions as input to synthesize the corresponding speech. Specifically, PromptTTS consists of a style encoder and a content encoder to extract the corresponding representations from the prompt, and a speech decoder to synthesize speech according to the extracted style and content representations. Compared with previous works in controllable TTS that require users to have acoustic knowledge to understand style factors such as prosody and pitch, PromptTTS is more user-friendly since text descriptions are a more natural way to express speech style (e.g., ''A lady whispers to her friend slowly''). Given that there is no TTS dataset with prompts, to benchmark the task of PromptTTS, we construct and release a dataset containing prompts with style and content information and the corresponding speech. Experiments show that PromptTTS can generate speech with precise style control and high speech quality. Audio samples and our dataset are publicly available.
This realistic text-to-speech tool is 98% off today
Video and audio have become a necessity in our everyday lives, especially when it comes to marketing a product or brand. When you need to create video and audio content to promote your business, text-to-speech tools can be very useful. Unfortunately, most of these apps have really robotic voices. If you want something that sounds more natural, Speechnow is worth your attention. This AI-powered app lets you turn text into audio in seconds, with 800 different languages and realistic voices to choose from.
Adapter-Based Extension of Multi-Speaker Text-to-Speech Model for New Speakers
Hsieh, Cheng-Ping, Ghosh, Subhankar, Ginsburg, Boris
Fine-tuning is a popular method for adapting text-to-speech (TTS) models to new speakers. However this approach has some challenges. Usually fine-tuning requires several hours of high quality speech per speaker. There is also that fine-tuning will negatively affect the quality of speech synthesis for previously learnt speakers. In this paper we propose an alternative approach for TTS adaptation based on using parameter-efficient adapter modules. In the proposed approach, a few small adapter modules are added to the original network. The original weights are frozen, and only the adapters are fine-tuned on speech for new speaker. The parameter-efficient fine-tuning approach will produce a new model with high level of parameter sharing with original model. Our experiments on LibriTTS, HiFi-TTS and VCTK datasets validate the effectiveness of adapter-based method through objective and subjective metrics.
Efficient few-shot learning for pixel-precise handwritten document layout analysis
De Nardin, Axel, Zottin, Silvia, Paier, Matteo, Foresti, Gian Luca, Colombi, Emanuela, Piciarelli, Claudio
Layout analysis is a task of uttermost importance in ancient handwritten document analysis and represents a fundamental step toward the simplification of subsequent tasks such as optical character recognition and automatic transcription. However, many of the approaches adopted to solve this problem rely on a fully supervised learning paradigm. While these systems achieve very good performance on this task, the drawback is that pixel-precise text labeling of the entire training set is a very time-consuming process, which makes this type of information rarely available in a real-world scenario. In the present paper, we address this problem by proposing an efficient few-shot learning framework that achieves performances comparable to current state-of-the-art fully supervised methods on the publicly available DIVA-HisDB dataset.
Residual Adapters for Few-Shot Text-to-Speech Speaker Adaptation
Morioka, Nobuyuki, Zen, Heiga, Chen, Nanxin, Zhang, Yu, Ding, Yifan
Adapting a neural text-to-speech (TTS) model to a target speaker typically involves fine-tuning most if not all of the parameters of a pretrained multi-speaker backbone model. However, serving hundreds of fine-tuned neural TTS models is expensive as each of them requires significant footprint and separate computational resources (e.g., accelerators, memory). To scale speaker adapted neural TTS voices to hundreds of speakers while preserving the naturalness and speaker similarity, this paper proposes a parameter-efficient few-shot speaker adaptation, where the backbone model is augmented with trainable lightweight modules called residual adapters. This architecture allows the backbone model to be shared across different target speakers. Experimental results show that the proposed approach can achieve competitive naturalness and speaker similarity compared to the full fine-tuning approaches, while requiring only $\sim$0.1% of the backbone model parameters for each speaker.
Image Analysis 4.0 with new API endpoint and OCR model in preview
Enterprises and hobbyists alike have been using Azure Computer Vision's Image Analysis API to garner various insights from their images. These insights help power scenarios such as digital asset management, search engine optimization (SEO), image content moderation, and alt text for accessibility among others. We are thrilled to announce the preview release of Computer Vision Image Analysis 4.0 which combines existing and new visual features such as read optical character recognition (OCR), captioning, image classification and tagging, object detection, people detection, and smart cropping into one API. One call is all it takes to run all these features on an image. The OCR feature integrates more deeply with the Computer Vision service and includes performance improvements that are optimized for image scenarios that make OCR easy to use for user interfaces and near real-time experiences.