Optical Character Recognition
The best part of the future is finally having a permanent replacement for this annoying technology
We don't have flying cars, jetpacks haven't replaced walking, and I have not seen a single sign that we're all pivoting to wearing matching silver jumpsuits. The future is kind of lame. SwiftScan VIP is a scanner tool that basically replaces half of your old office equipment with an app that works on iOS and Android devices. It's also a lot cheaper than some desktop scanners, and you don't need to replace it every few years. During this limited-time sale, you can get a SwiftScan VIP Lifetime Subscription for only 41.99 (it's usually 199.99).
Text-to-speech with feeling - this new AI model does everything but shed a tear
Not so long ago, generative AI could only communicate with human users via text. Now it's increasingly being given the power of speech -- and this ability is improving by the day. On Thursday, AI voice platform ElevenLabs introduced v3, described on the company's website as "the most expressive text-to-speech model ever." The new model can exhibit a wide range of emotions and subtle communicative quirks -- like sighs, laughter, and whispering -- making its speech more humanlike than the company's previous models. Also: Could WWDC be Apple's AI turning point?
Meta-Album: Multi-domain Meta-Dataset for Few-Shot Image Classification
We introduce Meta-Album, an image classification meta-dataset designed to facilitate few-shot learning, transfer learning, meta-learning, among other tasks. It includes 40 open datasets, each having at least 20 classes with 40 examples per class, with verified licences. They stem from diverse domains, such as ecology (fauna and flora), manufacturing (textures, vehicles), human actions, and optical character recognition, featuring various image scales (microscopic, human scales, remote sensing). All datasets are preprocessed, annotated, and formatted uniformly, and come in 3 versions (Micro Mini Extended) to match users' computational resources.
SHDocs: A dataset, benchmark, and method to efficiently generate high-quality, real-world specular highlight data with near-perfect alignment
A frequent problem in vision-based reasoning tasks such as object detection and optical character recognition (OCR) is the persistence of specular highlights. Specular highlights appear as bright spots of glare that occur due to the concentrated reflection of light; these spots manifest as image artifacts which occlude computer vision models and are challenging to reconstruct. Despite this, specular highlight removal receives relatively little attention due to the difficulty of acquiring high-quality, real-world data. We introduce a method to generate specular highlight data with near-perfect alignment and present SHDocs--a dataset of specular highlights on document images created using our method. Through our benchmark, we demonstrate that our dataset enables us to surpass the performance of state-of-the-art specular highlight removal models and downstream OCR tasks.
PortaSpeech: Portable and High-Quality Generative Text-to-Speech
Non-autoregressive text-to-speech (NAR-TTS) models such as FastSpeech 2 [24] and Glow-TTS [8] can synthesize high-quality speech from the given text in parallel. After analyzing two kinds of generative NAR-TTS models (VAE and normalizing flow), we find that: VAE is good at capturing the long-range semantics features (e.g., prosody) even with small model size but suffers from blurry and unnatural results; and normalizing flow is good at reconstructing the frequency bin-wise details but performs poorly when the number of model parameters is limited. Inspired by these observations, to generate diverse speech with natural details and rich prosody using a lightweight architecture, we propose PortaSpeech, a portable and high-quality generative text-to-speech model. Specifically, 1) to model both the prosody and mel-spectrogram details accurately, we adopt a lightweight VAE with an enhanced prior followed by a flow-based post-net with strong conditional inputs as the main architecture.
Windows Photos adds fancy editing features from other Microsoft apps
Microsoft is adding ways to make the Windows Photos app much more powerful, combining elements of the elegant Designer app and making Photos more of a centerpiece for visual editing. Microsoft is taking optical-character recognition capabilities that it developed several years ago and adding them to Photos, while pulling in design elements from Microsoft Designer, too. Finally, the company is beefing up File Explorer a bit as well, giving it a more robust visual search capability. Unfortunately, it's also adding a Copilot button as well, which for now doesn't really do much. Microsoft's Windows Photos app languished for years, but it started enjoying a renaissance about two years ago with new AI-powered editing features.
Appendices for the Paper: pFL-Bench: A Comprehensive Benchmark for Personalized Federated Learning
We provide more details and experimental results for pFL-Bench in the appendices: Sec.A: the details of adopted datasets and models (e.g., tasks, heterogeneous partitions, and model architectures), and the extensions for other datasets and models with pFL-Bench. Besides, to demonstrate the potential and ease of extensibility of the pFL-bench, we also conducted experiments in the heterogeneous device resource scenario based on FedScale [38] (Sec.D.4), as well as experiments incorporating privacy-preserving techniques (Sec.D.5). We present detailed descriptions of the 12 publicly available dataset variants used in pFL-Bench. These datasets are popular in the corresponding fields, and cover a wide range of domains, scales, partition manners, and Non-IID degrees. The Federated Extended MNIST (FEMNIST) is a widely used FL dataset for 62-class handwritten character recognition [32]. The original FEMNIST dataset contains 3,550 clients and each client corresponds to a character writer from EMNIST [91]. Following [13], we adopt the sub-sampled version in FL-Bench, which contains 200 clients and totally 43,400 images with resolution of 28x28 pixels, and the dataset is randomly split into train/valid/test sets with ratio 3:1:1.
Supplementary Material of Glow-TTS: A Generative Flow for Text-to-Speech via Monotonic Alignment Search
Details of the Model Architecture The detailed encoder architecture is depicted in Figure 7. Some implementation details that we use in the decoder, and the decoder architecture are depicted in Figure 8. We design the grouped 1x1 convolutions to be able to mix channels. For each group, the same number of channels are extracted from one half of the feature map separated by coupling layers and the other half, respectively. Figure 8c shows an example.