AITopics | Optical Character Recognition

Collaborating Authors

Optical Character Recognition

Our second example deals with a more challenging problem: the recognition of hand-printed letters of the alphabet. The characters that people print in the ordinary course of filling out forms and questionnaires are surprisingly varied. Gaps abound wherecontinuous lines might be expected; curves and sharp angles appear interchangeably; there is almost every imaginable distortion of slant, shape and size. Even human readers cannot always identify such characters; their error rate is about 3 per cent on randomly selected letters and numbers, seen out of context.
– from Oliver G. Selfridge & Ulric Neisser. PATTERN RECOGNITION BY MACHINE . In Computers & thought, Edward A. Feigenbaum and Julian Feldman (Eds.). MIT Press, Cambridge, MA, USA, 1963. pp. 8-30.

News Overviews Instructional Materials AI-Alerts Classics

XPhoneBERT: A Pre-trained Multilingual Model for Phoneme Representations for Text-to-Speech

Nguyen, Linh The, Pham, Thinh, Nguyen, Dat Quoc

arXiv.org Artificial IntelligenceMay-31-2023

We present XPhoneBERT, the first multilingual model pre-trained to learn phoneme representations for the downstream text-to-speech (TTS) task. Our XPhoneBERT has the same model architecture as BERT-base, trained using the RoBERTa pre-training approach on 330M phoneme-level sentences from nearly 100 languages and locales. Experimental results show that employing XPhoneBERT as an input phoneme encoder significantly boosts the performance of a strong neural TTS model in terms of naturalness and prosody and also helps produce fairly high-quality speech with limited training data. We publicly release our pre-trained XPhoneBERT with the hope that it would facilitate future research and downstream TTS applications for multiple languages. Our XPhoneBERT model is available at https://github.com/VinAIResearch/XPhoneBERT

artificial intelligence, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2305.19709

Country: Asia > Vietnam (0.04)

Genre: Research Report > New Finding (0.48)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Speech > Speech Synthesis (0.74)
Information Technology > Artificial Intelligence > Vision > Optical Character Recognition (0.63)

Add feedback

Resource-Efficient Fine-Tuning Strategies for Automatic MOS Prediction in Text-to-Speech for Low-Resource Languages

Do, Phat, Coler, Matt, Dijkstra, Jelske, Klabbers, Esther

arXiv.org Artificial IntelligenceMay-30-2023

We train a MOS prediction model based on wav2vec 2.0 using the open-access data sets BVCC and SOMOS. Our test with neural TTS data in the low-resource language (LRL) West Frisian shows that pre-training on BVCC before fine-tuning on SOMOS leads to the best accuracy for both fine-tuned and zero-shot prediction. Further fine-tuning experiments show that using more than 30 percent of the total data does not lead to significant improvements. In addition, fine-tuning with data from a single listener shows promising system-level accuracy, supporting the viability of one-participant pilot tests. These findings can all assist the resource-conscious development of TTS for LRLs by progressing towards better zero-shot MOS prediction and informing the design of listening tests, especially in early-stage evaluation.

large language model, machine learning, somo, (22 more...)

arXiv.org Artificial Intelligence

2305.19396

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
Europe > Netherlands (0.05)
Europe > Ireland > Leinster > County Dublin > Dublin (0.04)
(2 more...)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Speech > Speech Synthesis (0.53)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.46)
Information Technology > Artificial Intelligence > Vision > Optical Character Recognition (0.41)

Add feedback

Learning to Speak from Text: Zero-Shot Multilingual Text-to-Speech with Unsupervised Text Pretraining

Saeki, Takaaki, Maiti, Soumi, Li, Xinjian, Watanabe, Shinji, Takamichi, Shinnosuke, Saruwatari, Hiroshi

arXiv.org Artificial IntelligenceMay-27-2023

While neural text-to-speech (TTS) has achieved human-like natural synthetic speech, multilingual TTS systems are limited to resource-rich languages due to the need for paired text and studio-quality audio data. This paper proposes a method for zero-shot multilingual TTS using text-only data for the target language. The use of text-only data allows the development of TTS systems for low-resource languages for which only textual resources are available, making TTS accessible to thousands of languages. Inspired by the strong cross-lingual transferability of multilingual language models, our framework first performs masked language model pretraining with multilingual text-only data. Then we train this model with a paired data in a supervised manner, while freezing a language-aware embedding layer. This allows inference even for languages not included in the paired data but present in the text-only data. Evaluation results demonstrate highly intelligible zero-shot TTS with a character error rate of less than 12% for an unseen language.

large language model, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

2301.12596

Country:

Asia > Japan > Honshū > Kantō > Tokyo Metropolis Prefecture > Tokyo (0.14)
Europe > Germany > Bavaria > Upper Bavaria > Munich (0.04)
North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.04)
Asia > Japan > Honshū > Tōhoku > Iwate Prefecture > Morioka (0.04)

Genre: Research Report > New Finding (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.94)
Information Technology > Artificial Intelligence > Speech > Speech Synthesis (0.72)
Information Technology > Artificial Intelligence > Vision > Optical Character Recognition (0.71)

Add feedback

EfficientSpeech: An On-Device Text to Speech Model

Atienza, Rowel

arXiv.org Artificial IntelligenceMay-23-2023

State of the art (SOTA) neural text to speech (TTS) models can generate natural-sounding synthetic voices. These models are characterized by large memory footprints and substantial number of operations due to the long-standing focus on speech quality with cloud inference in mind. Neural TTS models are generally not designed to perform standalone speech syntheses on resource-constrained and no Internet access edge devices. In this work, an efficient neural TTS called EfficientSpeech that synthesizes speech on an ARM CPU in real-time is proposed. EfficientSpeech uses a shallow non-autoregressive pyramid-structure transformer forming a U-Network. EfficientSpeech has 266k parameters and consumes 90 MFLOPS only or about 1% of the size and amount of computation in modern compact models such as Mixer-TTS. EfficientSpeech achieves an average mel generation real-time factor of 104.3 on an RPi4. Human evaluation shows only a slight degradation in audio quality as compared to FastSpeech2.

artificial intelligence, efficientspeech, machine learning, (15 more...)

arXiv.org Artificial Intelligence

2305.13905

Country:

North America > Canada > Quebec > Montreal (0.05)
Asia > Philippines (0.04)

Genre: Research Report (0.40)

Industry: Information Technology (0.69)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Synthesis (0.92)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.69)
Information Technology > Artificial Intelligence > Vision > Optical Character Recognition (0.62)

Add feedback

ComedicSpeech: Text To Speech For Stand-up Comedies in Low-Resource Scenarios

Wang, Yuyue, Xiao, Huan, Wu, Yihan, Song, Ruihua

arXiv.org Artificial IntelligenceMay-20-2023

Text to Speech (TTS) models can generate natural and high-quality speech, but it is not expressive enough when synthesizing speech with dramatic expressiveness, such as stand-up comedies. Considering comedians have diverse personal speech styles, including personal prosody, rhythm, and fillers, it requires real-world datasets and strong speech style modeling capabilities, which brings challenges. In this paper, we construct a new dataset and develop ComedicSpeech, a TTS system tailored for the stand-up comedy synthesis in low-resource scenarios. First, we extract prosody representation by the prosody encoder and condition it to the TTS model in a flexible way. Second, we enhance the personal rhythm modeling by a conditional duration predictor. Third, we model the personal fillers by introducing comedian-related special tokens. Experiments show that ComedicSpeech achieves better expressiveness than baselines with only ten-minute training data for each comedian. The audio samples are available at https://xh621.github.io/stand-up-comedy-demo/

artificial intelligence, comedian, machine learning, (16 more...)

arXiv.org Artificial Intelligence

2305.122

Country:

Asia > South Korea > Incheon > Incheon (0.05)
Europe > Sweden > Stockholm > Stockholm (0.04)
Europe > Austria (0.04)
(4 more...)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Speech > Speech Synthesis (0.74)
Information Technology > Artificial Intelligence > Vision > Optical Character Recognition (0.63)

Add feedback

FastFit: Towards Real-Time Iterative Neural Vocoder by Replacing U-Net Encoder With Multiple STFTs

Jang, Won, Lim, Dan, Park, Heayoung

arXiv.org Artificial IntelligenceMay-18-2023

This paper presents FastFit, a novel neural vocoder architecture that replaces the U-Net encoder with multiple short-time Fourier transforms (STFTs) to achieve faster generation rates without sacrificing sample quality. We replaced each encoder block with an STFT, with parameters equal to the temporal resolution of each decoder block, leading to the skip connection. FastFit reduces the number of parameters and the generation time of the model by almost half while maintaining high fidelity. Through objective and subjective evaluations, we demonstrated that the proposed model achieves nearly twice the generation speed of baseline iteration-based vocoders while maintaining high sound quality. We further showed that FastFit produces sound qualities similar to those of other baselines in text-to-speech evaluation scenarios, including multi-speaker and zero-shot text-to-speech.

artificial intelligence, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

2305.10823

Country:

Europe > Italy > Calabria > Catanzaro Province > Catanzaro (0.04)
North America > United States (0.04)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.68)
Information Technology > Artificial Intelligence > Speech > Speech Synthesis (0.56)
Information Technology > Artificial Intelligence > Vision > Optical Character Recognition (0.56)

Add feedback

Parameter-Efficient Learning for Text-to-Speech Accent Adaptation

Yang, Li-Jen, Yang, Chao-Han Huck, Chien, Jen-Tzung

arXiv.org Artificial IntelligenceMay-18-2023

This paper presents a parameter-efficient learning (PEL) to develop a low-resource accent adaptation for text-to-speech (TTS). A resource-efficient adaptation from a frozen pre-trained TTS model is developed by using only 1.2\% to 0.8\% of original trainable parameters to achieve competitive performance in voice synthesis. Motivated by a theoretical foundation of optimal transport (OT), this study carries out PEL for TTS where an auxiliary unsupervised loss based on OT is introduced to maximize a difference between the pre-trained source domain and the (unseen) target domain, in addition to its supervised training loss. Further, we leverage upon this unsupervised loss refinement to boost system performance via either sliced Wasserstein distance or maximum mean discrepancy. The merit of this work is demonstrated by fulfilling PEL solutions based on residual adapter learning, and model reprogramming when evaluating the Mandarin accent adaptation. Experiment results show that the proposed methods can achieve competitive naturalness with parameter-efficient decoder fine-tuning, and the auxiliary unsupervised loss improves model performance empirically.

artificial intelligence, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

doi: 10.21437/Interspeech.2023-1212

2305.1132

Country:

North America > United States (0.14)
Oceania > Australia (0.04)
Europe > United Kingdom (0.04)
(4 more...)

Genre: Research Report (0.70)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Speech > Speech Synthesis (0.72)
(2 more...)

Add feedback

Multimodal Short Video Rumor Detection System Based on Contrastive Learning

Yang, Yuxing, Zhao, Junhao, Wang, Siyi, Min, Xiangyu, Wang, Pengchao, Wang, Haizhou

arXiv.org Artificial IntelligenceMay-17-2023

With the rise of short video platforms as prominent channels for news dissemination, major platforms in China have gradually evolved into fertile grounds for the proliferation of fake news. However, distinguishing short video rumors poses a significant challenge due to the substantial amount of information and shared features among videos, resulting in homogeneity. To address the dissemination of short video rumors effectively, our research group proposes a methodology encompassing multimodal feature fusion and the integration of external knowledge, considering the merits and drawbacks of each algorithm. The proposed detection approach entails the following steps: (1) creation of a comprehensive dataset comprising multiple features extracted from short videos; (2) development of a multimodal rumor detection model: first, we employ the Temporal Segment Networks (TSN) video coding model to extract video features, followed by the utilization of Optical Character Recognition (OCR) and Automatic Speech Recognition (ASR) to extract textual features. Subsequently, the BERT model is employed to fuse textual and video features; (3) distinction is achieved through contrast learning: we acquire external knowledge by crawling relevant sources and leverage a vector database to incorporate this knowledge into the classification output. Our research process is driven by practical considerations, and the knowledge derived from this study will hold significant value in practical scenarios, such as short video rumor identification and the management of social opinions.

information, machine learning, pattern recognition, (21 more...)

arXiv.org Artificial Intelligence

2304.08401

Country: Asia > China > Sichuan Province > Chengdu (0.04)

Genre: Research Report (0.50)

Industry: Media > News (0.34)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (1.00)
(3 more...)

Add feedback

A Late Multi-Modal Fusion Model for Detecting Hybrid Spam E-mail

Zhang, Zhibo, Damiani, Ernesto, Hamadi, Hussam Al, Yeun, Chan Yeob, Taher, Fatma

arXiv.org Artificial IntelligenceMay-15-2023

In recent years, spammers are now trying to obfuscate their intents by introducing hybrid spam e-mail combining both image and text parts, which is more challenging to detect in comparison to e-mails containing text or image only. The motivation behind this research is to design an effective approach filtering out hybrid spam e-mails to avoid situations where traditional text-based or image-baesd only filters fail to detect hybrid spam e-mails. To the best of our knowledge, a few studies have been conducted with the goal of detecting hybrid spam e-mails. Ordinarily, Optical Character Recognition (OCR) technology is used to eliminate the image parts of spam by transforming images into text. However, the research questions are that although OCR scanning is a very successful technique in processing text-and-image hybrid spam, it is not an effective solution for dealing with huge quantities due to the CPU power required and the execution time it takes to scan e-mail files. And the OCR techniques are not always reliable in the transformation processes. To address such problems, we propose new late multi-modal fusion training frameworks for a text-and-image hybrid spam e-mail filtering system compared to the classical early fusion detection frameworks based on the OCR method. Convolutional Neural Network (CNN) and Continuous Bag of Words were implemented to extract features from image and text parts of hybrid spam respectively, whereas generated features were fed to sigmoid layer and Machine Learning based classifiers including Random Forest (RF), Decision Tree (DT), Naive Bayes (NB) and Support Vector Machine (SVM) to determine the e-mail ham or spam.

artificial intelligence, machine learning, optical character recognition, (4 more...)

arXiv.org Artificial Intelligence

2210.14616

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Vision > Optical Character Recognition (0.53)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Support Vector Machines (0.53)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.53)
(2 more...)

Add feedback

M2-CTTS: End-to-End Multi-scale Multi-modal Conversational Text-to-Speech Synthesis

Xue, Jinlong, Deng, Yayue, Wang, Fengping, Li, Ya, Gao, Yingming, Tao, Jianhua, Sun, Jianqing, Liang, Jiaen

arXiv.org Artificial IntelligenceMay-3-2023

Conversational text-to-speech (TTS) aims to synthesize speech with proper prosody of reply based on the historical conversation. However, it is still a challenge to comprehensively model the conversation, and a majority of conversational TTS systems only focus on extracting global information and omit local prosody features, which contain important fine-grained information like keywords and emphasis. Moreover, it is insufficient to only consider the textual features, and acoustic features also contain various prosody information. Hence, we propose M2-CTTS, an end-to-end multi-scale multi-modal conversational text-to-speech system, aiming to comprehensively utilize historical conversation and enhance prosodic expression. More specifically, we design a textual context module and an acoustic context module with both coarse-grained and fine-grained modeling. Experimental results demonstrate that our model mixed with fine-grained context information and additionally considering acoustic features achieves better prosody performance and naturalness in CMOS tests.

artificial intelligence, information, optical character recognition, (11 more...)

arXiv.org Artificial Intelligence

2305.02269

Country:

Asia > China > Beijing > Beijing (0.05)
North America > Canada > Quebec > Montreal (0.05)
North America > United States (0.04)
(2 more...)

Genre: Research Report > New Finding (0.48)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Synthesis (1.00)
Information Technology > Artificial Intelligence > Vision > Optical Character Recognition (0.82)

Add feedback