AITopics

Country: Asia > South Korea > Seoul > Seoul (0.24)

Industry: Banking & Finance > Insurance (1.00)

Technology: Information Technology > Artificial Intelligence > Vision > Optical Character Recognition (1.00)

#artificialintelligenceJan-22-2023, 10:45:06 GMT

💣Notes to Self: Optical Character Recognition or Optical Character Reader (OCR)

Optical character recognition (OCR) is a technology that allows computers to recognize and extract text from images, such as scanned documents, photographs, bills, etc. The process involves analyzing the image and identifying the individual characters within it and then converting those characters into machine-readable text. OCR software can be used to automate tasks such as document scanning, business automation, and accessibility technology. OCR software uses complex algorithms and pattern recognition techniques to identify and extract text. OCR technology has evolved over time and now it has the ability to recognize text in multiple languages and different fonts.

artificial intelligence, machine learning, optical character recognition, (13 more...)

Technology:

Information Technology > Artificial Intelligence > Vision > Optical Character Recognition (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

arXiv.org Artificial IntelligenceJan-20-2023

Phoneme-Level BERT for Enhanced Prosody of Text-to-Speech with Grapheme Predictions

Li, Yinghao Aaron, Han, Cong, Jiang, Xilin, Mesgarani, Nima

Large-scale pre-trained language models have been shown to be helpful in improving the naturalness of text-to-speech (TTS) models by enabling them to produce more naturalistic prosodic patterns. However, these models are usually word-level or sup-phoneme-level and jointly trained with phonemes, making them inefficient for the downstream TTS task where only phonemes are needed. In this work, we propose a phoneme-level BERT (PL-BERT) with a pretext task of predicting the corresponding graphemes along with the regular masked phoneme predictions. Subjective evaluations show that our phoneme-level BERT encoder has significantly improved the mean opinion scores (MOS) of rated naturalness of synthesized speech compared with the state-of-the-art (SOTA) StyleTTS baseline on out-of-distribution (OOD) texts.

artificial intelligence, machine learning, natural language, (17 more...)

2301.0881

Country: North America > United States (0.28)

Genre:

Research Report > New Finding (0.69)
Research Report > Experimental Study (0.69)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Speech > Speech Synthesis (0.74)
Information Technology > Artificial Intelligence > Vision > Optical Character Recognition (0.72)

Jiamsuchon, Chissanupong, Suaboot, Jakapan, Rattanavipanon, Norrathep

On the feasibility of attacking Thai LPR systems with adversarial examples

arXiv.org Artificial IntelligenceJan-13-2023

Recent advances in deep neural networks (DNNs) have significantly enhanced the capabilities of optical character recognition (OCR) technology, enabling its adoption to a wide range of real-world applications. Despite this success, DNN-based OCR is shown to be vulnerable to adversarial attacks, in which the adversary can influence the DNN model's prediction by carefully manipulating input to the model. Prior work has demonstrated the security impacts of adversarial attacks on various OCR languages. However, to date, no studies have been conducted and evaluated on an OCR system tailored specifically for the Thai language. To bridge this gap, this work presents a feasibility study of performing adversarial attacks on a specific Thai OCR application -- Thai License Plate Recognition (LPR). Moreover, we propose a new type of adversarial attack based on the \emph{semi-targeted} scenario and show that this scenario is highly realistic in LPR applications. Our experimental results show the feasibility of our attacks as they can be performed on a commodity computer desktop with over 90% attack success rate.

adversarial example, artificial intelligence, machine learning, (19 more...)

2301.05506

Country:

Asia > Thailand > Phuket > Phuket (0.06)
North America > United States > Nevada > Clark County > Las Vegas (0.05)
South America > Brazil > Paraná > Curitiba (0.04)
(3 more...)

Genre: Research Report (0.70)

Industry:

Information Technology > Security & Privacy (1.00)
Government > Military (1.00)

Technology:

Information Technology > Security & Privacy (1.00)
Information Technology > Artificial Intelligence > Vision > Optical Character Recognition (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.67)

arXiv.org Artificial IntelligenceJan-10-2023

UnifySpeech: A Unified Framework for Zero-shot Text-to-Speech and Voice Conversion

Liu, Haogeng, Wang, Tao, Fu, Ruibo, Yi, Jiangyan, Wen, Zhengqi, Tao, Jianhua

Text-to-speech (TTS) and voice conversion (VC) are two different tasks both aiming at generating high quality speaking voice according to different input modality. Due to their similarity, this paper proposes UnifySpeech, which brings TTS and VC into a unified framework for the first time. The model is based on the assumption that speech can be decoupled into three independent components: content information, speaker information, prosody information. Both TTS and VC can be regarded as mining these three parts of information from the input and completing the reconstruction of speech. For TTS, the speech content information is derived from the text, while in VC it's derived from the source speech, so all the remaining units are shared except for the speech content extraction module in the two tasks. We applied vector quantization and domain constrain to bridge the gap between the content domains of TTS and VC. Objective and subjective evaluation shows that by combining the two task, TTS obtains better speaker modeling ability while VC gets hold of impressive speech content decoupling capability.

artificial intelligence, information, machine learning, (19 more...)

2301.03801

Country:

Asia > China > Beijing > Beijing (0.04)
North America > Canada > Quebec > Montreal (0.04)

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Speech > Speech Synthesis (0.73)
Information Technology > Artificial Intelligence > Vision > Optical Character Recognition (0.62)

#artificialintelligenceJan-5-2023, 10:31:32 GMT

Apple Books quietly launches AI-narrated audiobooks - The Verge

Apple's website says the feature is initially only available for romance and fiction books, where it lists two available digital voices: Madison and Jackson. The service is only available in English at present, and Apple is oddly specific about the genres of books its digital narrators are able to tackle. "Primary category must be romance or fiction (literary, historical, and women's fiction are eligible; mysteries and thrillers, and science fiction and fantasy are not currently supported)," its website reads.

ai-narrated audiobook, science fiction, speech synthesis, (4 more...)

Industry: Media > Publishing (0.40)

Technology:

Information Technology > Artificial Intelligence > Science Fiction (0.75)
Information Technology > Artificial Intelligence > Vision > Optical Character Recognition (0.40)
Information Technology > Artificial Intelligence > Speech > Speech Synthesis (0.40)

#artificialintelligenceJan-4-2023, 15:44:57 GMT

Convert Text to Speech in Python - DataFlair

Text to speech is a process to convert any text into voice. Text to speech project takes words on digital devices and convert them into audio with a button click or finger touch. Text to speech python project is very helpful for people who are struggling with reading. To implement this project, we will use the basic concepts of Python, Tkinter, gTTS, and playsound libraries. The objective of this project is to convert the text into voice with the click of a button.

artificial intelligence, library, optical character recognition, (9 more...)

Technology:

Information Technology > Artificial Intelligence > Vision > Optical Character Recognition (1.00)
Information Technology > Artificial Intelligence > Speech > Speech Synthesis (1.00)
Information Technology > Artificial Intelligence > Assistive Technologies (1.00)

#artificialintelligenceJan-2-2023, 22:00:42 GMT

GitHub - jaketae/storyteller: Multimodal AI Story Teller, built with Stable Diffusion, GPT, and neural text-to-speech

A multimodal AI story teller, built with Stable Diffusion, GPT, and neural text-to-speech (TTS). Given a prompt as an opening line of a story, GPT writes the rest of the plot; Stable Diffusion draws an image for each sentence; a TTS model narrates each line, resulting in a fully animated video of a short story, replete with audio and visuals.

multimodal ai story teller, neural text-to-speech, stable diffusion, (3 more...)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Vision > Optical Character Recognition (0.75)
Information Technology > Artificial Intelligence > Speech > Speech Synthesis (0.75)
Information Technology > Artificial Intelligence > Assistive Technologies (0.75)

#artificialintelligenceDec-29-2022, 04:15:09 GMT

Convert text to speech quickly with this intuitive platform

Some operations and tasks don't require painstaking attention to detail. With sensitive salary and wage information, bank and direct deposit accounts, social security numbers, and other personal information in play, the stakes are high. When preparing a payroll run or supporting payroll operations, it's important to follow a ...

information, intuitive platform

Technology:

Information Technology > Artificial Intelligence > Vision > Optical Character Recognition (0.40)
Information Technology > Artificial Intelligence > Speech > Speech Synthesis (0.40)
Information Technology > Artificial Intelligence > Assistive Technologies (0.40)

arXiv.org Artificial IntelligenceDec-29-2022

ResGrad: Residual Denoising Diffusion Probabilistic Models for Text to Speech

Chen, Zehua, Wu, Yihan, Leng, Yichong, Chen, Jiawei, Liu, Haohe, Tan, Xu, Cui, Yang, Wang, Ke, He, Lei, Zhao, Sheng, Bian, Jiang, Mandic, Danilo

Denoising Diffusion Probabilistic Models (DDPMs) are emerging in text-to-speech (TTS) synthesis because of their strong capability of generating high-fidelity samples. However, their iterative refinement process in high-dimensional data space results in slow inference speed, which restricts their application in real-time systems. Previous works have explored speeding up by minimizing the number of inference steps but at the cost of sample quality. In this work, to improve the inference speed for DDPM-based TTS model while achieving high sample quality, we propose ResGrad, a lightweight diffusion model which learns to refine the output spectrogram of an existing TTS model (e.g., FastSpeech 2) by predicting the residual between the model output and the corresponding ground-truth speech. ResGrad has several advantages: 1) Compare with other acceleration methods for DDPM which need to synthesize speech from scratch, ResGrad reduces the complexity of task by changing the generation target from ground-truth mel-spectrogram to the residual, resulting into a more lightweight model and thus a smaller real-time factor. 2) ResGrad is employed in the inference process of the existing TTS model in a plug-and-play way, without re-training this model. We verify ResGrad on the single-speaker dataset LJSpeech and two more challenging datasets with multiple speakers (LibriTTS) and high sampling rate (VCTK). Experimental results show that in comparison with other speed-up methods of DDPMs: 1) ResGrad achieves better sample quality with the same inference speed measured by real-time factor; 2) with similar speech quality, ResGrad synthesizes speech faster than baseline methods by more than 10 times. Audio samples are available at https://resgrad1.github.io/.

machine learning, real time system, residual denoising diffusion probabilistic model, (4 more...)

2212.14518

Genre: Research Report (0.69)

Technology:

Information Technology > Architecture > Real Time Systems (0.93)
Information Technology > Artificial Intelligence > Machine Learning (0.89)
Information Technology > Artificial Intelligence > Vision > Optical Character Recognition (0.60)
(2 more...)