Optical Character Recognition
Windows Photos adds fancy editing features from other Microsoft apps
Microsoft is adding ways to make the Windows Photos app much more powerful, combining elements of the elegant Designer app and making Photos more of a centerpiece for visual editing. Microsoft is taking optical-character recognition capabilities that it developed several years ago and adding them to Photos, while pulling in design elements from Microsoft Designer, too. Finally, the company is beefing up File Explorer a bit as well, giving it a more robust visual search capability. Unfortunately, it's also adding a Copilot button as well, which for now doesn't really do much. Microsoft's Windows Photos app languished for years, but it started enjoying a renaissance about two years ago with new AI-powered editing features.
One of the most frustrating problems at work: solved
It's 2025, and converting files from one format to another should only take a few clicks. But it often becomes a whole lengthy process requiring uploads to unsecured online converting apps that can put your personal information at risk. Usually, this PDF conversion license is 99.99, but right now, it's down to 23.99 when you use code SAVE20 at checkout. PDF Converter Pro works with Microsoft Word, Excel, PowerPoint, Text, HTML, PNG, and JPG files. It even maintains your original layouts, images, and hyperlinks even after conversion without losing quality.
This new text-to-speech AI model understands what it's saying - how to try it for free
Text-to-speech AI models are a great tool for instances where human voice actors are typically used, such as audiobooks, dubbing, commercials, and more. However, because these models are not human and unaware of what they say, they can sometimes sound noticeably robotic. Hume's new AI model seeks to tackle this issue. On Wednesday, Hume launched Octave, a text-to-speech large language model (LLM) with contextual awareness. The LLM can use this awareness to adjust its tune, rhythm, and timbre of speech to the words it is reading based on their meaning, according to the company.
Calibrated Structured Prediction
Volodymyr Kuleshov, Percy S. Liang
In user-facing applications, displaying calibrated confidence measures-- probabilities that correspond to true frequency--can be as important as obtaining high accuracy. We are interested in calibration for structured prediction problems such as speech recognition, optical character recognition, and medical diagnosis. Structured prediction presents new challenges for calibration: the output space is large, and users may issue many types of probability queries (e.g., marginals) on the structured output. We extend the notion of calibration so as to handle various subtleties pertaining to the structured setting, and then provide a simple recalibration method that trains a binary classifier to predict probabilities of interest. We explore a range of features appropriate for structured recalibration, and demonstrate their efficacy on three real-world datasets.
StyleTTS 2: Towards Human-Level Text-to-Speech through Style Diffusion and Adversarial Training with Large Speech Language Models
In this paper, we present StyleTTS 2, a text-to-speech (TTS) model that leverages style diffusion and adversarial training with large speech language models (SLMs) to achieve human-level TTS synthesis. StyleTTS 2 differs from its predecessor by modeling styles as a latent random variable through diffusion models to generate the most suitable style for the text without requiring reference speech, achieving efficient latent diffusion while benefiting from the diverse speech synthesis offered by diffusion models. Furthermore, we employ large pre-trained SLMs, such as WavLM, as discriminators with our novel differentiable duration modeling for endto-end training, resulting in improved speech naturalness. StyleTTS 2 surpasses human recordings on the single-speaker LJSpeech dataset and matches it on the multispeaker VCTK dataset as judged by native English speakers. Moreover, when trained on the LibriTTS dataset, our model outperforms previous publicly available models for zero-shot speaker adaptation. This work achieves the first human-level TTS on both single and multispeaker datasets, showcasing the potential of style diffusion and adversarial training with large SLMs. The audio demos and source code are available at https://styletts2.github.io/.
PortaSpeech: Portable and High-Quality Generative Text-to-Speech
Non-autoregressive text-to-speech (NAR-TTS) models such as FastSpeech 2 [24] and Glow-TTS [8] can synthesize high-quality speech from the given text in parallel. After analyzing two kinds of generative NAR-TTS models (VAE and normalizing flow), we find that: VAE is good at capturing the long-range semantics features (e.g., prosody) even with small model size but suffers from blurry and unnatural results; and normalizing flow is good at reconstructing the frequency bin-wise details but performs poorly when the number of model parameters is limited. Inspired by these observations, to generate diverse speech with natural details and rich prosody using a lightweight architecture, we propose PortaSpeech, a portable and high-quality generative text-to-speech model. Specifically, 1) to model both the prosody and mel-spectrogram details accurately, we adopt a lightweight VAE with an enhanced prior followed by a flow-based post-net with strong conditional inputs as the main architecture.
ProtoSnap: Prototype Alignment for Cuneiform Signs
Mikulinsky, Rachel, Alper, Morris, Gordin, Shai, Jimรฉnez, Enrique, Cohen, Yoram, Averbuch-Elor, Hadar
The cuneiform writing system served as the medium for transmitting knowledge in the ancient Near East for a period of over three thousand years. Cuneiform signs have a complex internal structure which is the subject of expert paleographic analysis, as variations in sign shapes bear witness to historical developments and transmission of writing and culture over time. However, prior automated techniques mostly treat sign types as categorical and do not explicitly model their highly varied internal configurations. In this work, we present an unsupervised approach for recovering the fine-grained internal configuration of cuneiform signs by leveraging powerful generative models and the appearance and structure of prototype font images as priors. Our approach, ProtoSnap, enforces structural consistency on matches found with deep image features to estimate the diverse configurations of cuneiform characters, snapping a skeleton-based template to photographed cuneiform signs. We provide a new benchmark of expert annotations and evaluate our method on this task. Our evaluation shows that our approach succeeds in aligning prototype skeletons to a wide variety of cuneiform signs. Moreover, we show that conditioning on structures produced by our method allows for generating synthetic data with correct structural configurations, significantly boosting the performance of cuneiform sign recognition beyond existing techniques, in particular over rare signs. Cuneiform signs have complex internal structures which varied significantly across the eras, cultures, and geographic regions among which cuneiform writing was used. The study of these variations is part of a field called paleography, which is crucial for understanding the historical context of attested writing (Biggs, 1973; Homburg, 2021). However, while computational methods show promise for aiding experts in analyzing cuneiform texts (Bogacz and Mara, 2022), they are challenged by the vast variety of complex sign variants and their visual nature: Represented as wedge-shaped imprints in clay tablets which have often sustained physical damage, cuneiform appears as shadows on a non-uniform clay surface which may even be difficult for human experts to identify under non-optimal lighting conditions (Taylor, 2015).
Dict-TTS: Learning to Pronounce with Prior Dictionary Knowledge for Text-to-Speech
Polyphone disambiguation aims to capture accurate pronunciation knowledge from natural text sequences for reliable Text-to-speech (TTS) systems. However, previous approaches require substantial annotated training data and additional efforts from language experts, making it difficult to extend high-quality neural TTS systems to out-of-domain daily conversations and countless languages worldwide. This paper tackles the polyphone disambiguation problem from a concise and novel perspective: we propose Dict-TTS, a semantic-aware generative text-to-speech model with an online website dictionary (the existing prior information in the natural language). Specifically, we design a semantics-to-pronunciation attention (S2PA) module to match the semantic patterns between the input text sequence and the prior semantics in the dictionary and obtain the corresponding pronunciations; The S2PA module can be easily trained with the end-to-end TTS model without any annotated phoneme labels. Experimental results in three languages show that our model outperforms several strong baseline models in terms of pronunciation accuracy and improves the prosody modeling of TTS systems. Further extensive analyses demonstrate that each design in Dict-TTS is effective.
Supplementary Material of Glow-TTS: A Generative Flow for Text-to-Speech via Monotonic Alignment Search
Details of the Model Architecture The detailed encoder architecture is depicted in Figure 7. Some implementation details that we use in the decoder, and the decoder architecture are depicted in Figure 8. We design the grouped 1x1 convolutions to be able to mix channels. For each group, the same number of channels are extracted from one half of the feature map separated by coupling layers and the other half, respectively. Figure 8c shows an example.
Glow-TTS: A Generative Flow for Text-to-Speech via Monotonic Alignment Search
Recently, text-to-speech (TTS) models such as FastSpeech and ParaNet have been proposed to generate mel-spectrograms from text in parallel. Despite the advantage, the parallel TTS models cannot be trained without guidance from autoregressive TTS models as their external aligners. In this work, we propose Glow-TTS, a flow-based generative model for parallel TTS that does not require any external aligner. By combining the properties of flows and dynamic programming, the proposed model searches for the most probable monotonic alignment between text and the latent representation of speech on its own. We demonstrate that enforcing hard monotonic alignments enables robust TTS, which generalizes to long utterances, and employing generative flows enables fast, diverse, and controllable speech synthesis.