Speech
AI comes to Reddit's main search bar - who needs Google now?
It's getting a little easier to use Reddit as a search engine. Last year, Reddit rolled out a new feature called Reddit Answers. Since so many people use Reddit as a Google replacement to tap into the community's immense knowledge, the site introduced AI-curated answers related to the topic you were looking for. Also: Meta's new AI app delivers a chatbot with a social media twist At the time, you had to specifically head to a Reddit Answers tab to use the feature, but now it's coming to the main search bar. In the company's first-quarter earnings call yesterday, Reddit CEO Steve Huffman said that the site was "working to integrate it [Reddit Answers] into Reddit's core search experience to further streamline the path from question to answer on Reddit."
Step aside, Siri: Perplexity's new AI voice assistant for iPhone can take it from here
There's a new AI in town threatening to take over your territory. The latest version of Perplexity's iPhone app introduces a new voice assistant designed to perform a variety of tasks. Many of these tasks are typically reserved for Siri, as they are not only interactive but can also access key information on your phone. Just like Siri, you can ask Perplexity's voice assistant to set a reminder, schedule a calendar event, play a song from Apple Music, open a podcast, and get directions via Apple Maps. Simply tell it to perform any of these tasks, and Perplexity will interact with the appropriate app or feature and display the results.
Subtitling Your Life
A little over thirty years ago, when he was in his mid-forties, my friend David Howorth lost all hearing in his left ear, a calamity known as single-sided deafness. "It happened literally overnight," he said. "My doctor told me, 'We really don't understand why.' " At the time, he was working as a litigator in the Portland, Oregon, office of a large law firm. His hearing loss had no impact on his job--"In a courtroom, you can get along fine with one ear"--but other parts of his life were upended. The brain pinpoints sound sources in part by analyzing minute differences between left-ear and right-ear arrival times, the same process that helps bats and owls find prey they can't see.
Rejected by 16 colleges, hired by Google. Now he's suing some of the schools for anti-Asian discrimination
Stanley Zhong had a 4.42 grade point average, a nearly perfect SAT score, had bested adults in competitive coding competitions and started his own electronic signing service all while still in high school. When it came time to apply to colleges, Zhong's family wasn't overly concerned about his prospects even amid an increasingly competitive admissions environment. But, by the end of his senior year in Palo Alto in 2023, Zhong received rejection letters to 16 of the 18 colleges where he applied, including five University of California campuses that his father had figured would be safety schools. "It was surprise upon surprise upon surprise, and then it turned into frustration and, eventually, anger," his father, Nan Zhong, told The Times in a recent interview. "And I think both Stanley and I felt the same way, that something is really funky here."
TeleAntiFraud-28k: An Audio-Text Slow-Thinking Dataset for Telecom Fraud Detection
Ma, Zhiming, Wang, Peidong, Huang, Minhua, Wang, Jingpeng, Wu, Kai, Lv, Xiangzhao, Pang, Yachun, Yang, Yin, Tang, Wenjie, Kang, Yuchen
The detection of telecom fraud faces significant challenges due to the lack of high-quality multimodal training data that integrates audio signals with reasoning-oriented textual analysis. To address this gap, we present TeleAntiFraud-28k, the first open-source audio-text slow-thinking dataset specifically designed for automated telecom fraud analysis. Our dataset is constructed through three strategies: (1) Privacy-preserved text-truth sample generation using automatically speech recognition (ASR)-transcribed call recordings (with anonymized original audio), ensuring real-world consistency through text-to-speech (TTS) model regeneration; (2) Semantic enhancement via large language model (LLM)-based self-instruction sampling on authentic ASR outputs to expand scenario coverage; (3) Multi-agent adversarial synthesis that simulates emerging fraud tactics through predefined communication scenarios and fraud typologies. The generated dataset contains 28,511 rigorously processed speech-text pairs, complete with detailed annotations for fraud reasoning. The dataset is divided into three tasks: scenario classification, fraud detection, fraud type classification. Furthermore, we construct TeleAntiFraud-Bench, a standardized evaluation benchmark comprising proportionally sampled instances from the dataset, to facilitate systematic testing of model performance on telecom fraud detection tasks. We also contribute a production-optimized supervised fine-tuning (SFT) model trained on hybrid real/synthetic data, while open-sourcing the data processing framework to enable community-driven dataset expansion. This work establishes a foundational framework for multimodal anti-fraud research while addressing critical challenges in data privacy and scenario diversity. The project will be released at https://github.com/JimmyMa99/TeleAntiFraud.
On the Role of Priors in Bayesian Causal Learning
Geiger, Bernhard C., Kern, Roman
--In this work, we investigate causal learning of independent causal mechanisms from a Bayesian perspective. Confirming previous claims from the literature, we show in a didactically accessible manner that unlabeled data (i.e., cause realizations) do not improve the estimation of the parameters defining the mechanism. Furthermore, we observe the importance of choosing an appropriate prior for the cause and mechanism parameters, respectively. Specifically, we show that a factorized prior results in a factorized posterior, which resonates with Janz-ing and Sch olkopf's definition of independent causal mechanisms via the Kolmogorov complexity of the involved distributions and with the concept of parameter independence of Heckerman et al. Impact Statement --Learning the effect from a given cause is an important problem in many engineering disciplines, specifically in the field of surrogate modeling, which aims to reduce the computational cost of numerical simulations. Causal learning, however, cannot make use of unlabeled data - i.e., cause realizations - if the mechanism that produces the effect is independent from the cause. In this work, we recover this well-known fact from a Bayesian perspective.
SupertonicTTS: Towards Highly Scalable and Efficient Text-to-Speech System
Kim, Hyeongju, Yang, Jinhyeok, Yu, Yechan, Ji, Seunghun, Morton, Jacob, Bous, Frederik, Byun, Joon, Lee, Juheon
We present a novel text-to-speech (TTS) system, namely SupertonicTTS, for improved scalability and efficiency in speech synthesis. SupertonicTTS is comprised of three components: a speech autoencoder for continuous latent representation, a text-to-latent module leveraging flow-matching for text-to-latent mapping, and an utterance-level duration predictor. To enable a lightweight architecture, we employ a low-dimensional latent space, temporal compression of latents, and ConvNeXt blocks. We further simplify the TTS pipeline by operating directly on raw character-level text and employing cross-attention for text-speech alignment, thus eliminating the need for grapheme-to-phoneme (G2P) modules and external aligners. In addition, we introduce context-sharing batch expansion that accelerates loss convergence and stabilizes text-speech alignment. Experimental results demonstrate that SupertonicTTS achieves competitive performance while significantly reducing architectural complexity and computational overhead compared to contemporary TTS models. Audio samples demonstrating the capabilities of SupertonicTTS are available at: https://supertonictts.github.io/.
Unified Speech Recognition: A Single Model for Auditory, Visual, and Audiovisual Inputs
Research in auditory, visual, and audiovisual speech recognition (ASR, VSR, and AVSR, respectively) has traditionally been conducted independently. Even recent self-supervised studies addressing two or all three tasks simultaneously tend to yield separate models, leading to disjoint inference pipelines with increased memory requirements and redundancies. This paper proposes unified training strategies for these systems. We demonstrate that training a single model for all three tasks enhances VSR and AVSR performance, overcoming typical optimisation challenges when training from scratch. Moreover, we introduce a greedy pseudo-labelling approach to more effectively leverage unlabelled samples, addressing shortcomings in related self-supervised methods. Finally, we develop a self-supervised pretraining method within our framework, proving its effectiveness alongside our semi-supervised approach. Despite using a single model for all tasks, our unified approach achieves state-of-the-art performance compared to recent methods on LRS3 and LRS2 for ASR, VSR, and AVSR, as well as on the newly released WildVSR dataset. Code and models are available at https://github.com/
Disentangling Voice and Content with Self-Supervision for Speaker Recognition, Kong Aik Lee
For speaker recognition, it is difficult to extract an accurate speaker representation from speech because of its mixture of speaker traits and content. This paper proposes a disentanglement framework that simultaneously models speaker traits and content variability in speech. It is realized with the use of three Gaussian inference layers, each consisting of a learnable transition model that extracts distinct speech components. Notably, a strengthened transition model is specifically designed to model complex speech dynamics. We also propose a self-supervision method to dynamically disentangle content without the use of labels other than speaker identities. The efficacy of the proposed framework is validated via experiments conducted on the VoxCeleb and SITW datasets with 9.56% and 8.24% average reductions in EER and minDCF, respectively. Since neither additional model training nor data is specifically needed, it is easily applicable in practical use.