human voice
Xania Monet's music is the stuff of nightmares. Thankfully her AI 'clankers' will be limited to this cultural moment Van Badham
Xania Monet is'a photorealistic digital avatar accompanied by a sound that computers have generated to resemble that of a human voice singing words', writes Van Badham. Xania Monet is'a photorealistic digital avatar accompanied by a sound that computers have generated to resemble that of a human voice singing words', writes Van Badham. Xania Monet's music is the stuff of nightmares. Thankfully her AI'clankers' will be limited to this cultural moment Xania Monet is the latest digital nightmare to emerge from a hellscape of AI content production. The music iteration of AI "actor" Tilly Norwood, Xania is a composite product manufactured of digital tools: in this case, a photorealistic avatar accompanied by a sound that computers have generated to resemble that of a human voice singing words.
- North America > United States (0.17)
- Oceania > Australia (0.06)
- Europe > Ukraine (0.06)
- Europe > Denmark (0.05)
- Media > Music (1.00)
- Leisure & Entertainment (1.00)
FusionAudio-1.2M: Towards Fine-grained Audio Captioning with Multimodal Contextual Fusion
Chen, Shunian, Xie, Xinyuan, Chen, Zheshu, Zhao, Liyan, Lee, Owen, Su, Zhan, Sun, Qilin, Wang, Benyou
High-quality, large-scale audio captioning is crucial for advancing audio understanding, yet current automated methods often generate captions that lack fine-grained detail and contextual accuracy, primarily due to their reliance on limited unimodal or superficial multimodal information. Drawing inspiration from human auditory perception, which adeptly integrates cross-modal cues and performs sophisticated auditory scene analysis, we introduce a novel two-stage automated pipeline. This pipeline first employs specialized pretrained models to extract diverse contextual cues (e.g., speech, music, general sounds, and visual information from associated video). A large language model (LLM) then synthesizes these rich, multimodal inputs to generate detailed and context-aware audio captions. Key contributions of this work include: (1) the proposed scalable method for fine-grained audio caption generation; (2) FusionAudio, a new large-scale dataset comprising 1.2 million such detailed captions, combined with 6 million QA pairs; and (3) enhanced audio models developed using FusionAudio, specifically a CLAP-based audio encoder with superior audio-text alignment and instruction following. This paper paves the way for more nuanced and accurate automated understanding of complex audio environments. Code and data can be found in https://github.com/satsuki2486441738/FusionAudio.
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- Asia > China > Guangdong Province > Shenzhen (0.05)
- North America > United States > Florida > Miami-Dade County > Miami (0.04)
- (2 more...)
- Leisure & Entertainment (0.93)
- Media > Music (0.46)
ArVoice: A Multi-Speaker Dataset for Arabic Speech Synthesis
Toyin, Hawau Olamide, Marew, Rufael, Alblooshi, Humaid, Magdy, Samar M., Aldarmaki, Hanan
We introduce ArV oice, a multi-speaker Modern Standard Arabic (MSA) speech corpus with diacritized transcriptions, intended for multi-speaker speech synthesis, and can be useful for other tasks such as speech-based diacritic restoration, voice conversion, and deepfake detection. ArV oice comprises: (1) a new professionally recorded set from six voice talents with diverse demographics, (2) a modified subset of the Arabic Speech Corpus; and (3) high-quality synthetic speech from two commercial systems. The complete corpus consists of a total of 83.52 hours of speech across 11 voices; around 10 hours consist of human voices from 7 speakers. We train three open-source TTS and two voice conversion systems to illustrate the use cases of the dataset. The corpus is available for research use.
- Africa > Middle East > Egypt (0.05)
- North America > United States > Florida > Miami-Dade County > Miami (0.04)
- Europe > France > Provence-Alpes-Côte d'Azur > Bouches-du-Rhône > Marseille (0.04)
- (6 more...)
Is a Chat with a Bot a Conversation?
You are at the Princess's ball, and she is telling you a secret, but her orchestra of bears is making such a fearful lot of noise you cannot hear what she is saying. What do you say, dear? I'd lean in closer and say, "Could you repeat that? The bear-itone section is a bit too enthusiastic tonight!" In 1958, the year the illustrated children's book "What Do You Say, Dear?" appeared, the leaders of a field newly dubbed "artificial intelligence" spoke at a conference in Teddington, England, on "The Mechanisation of Thought Processes." Marvin Minsky, of M.I.T., talked about heuristic programming; Alan Turing gave a paper called "Learning Machines"; Grace Hopper assessed the state of computer languages; and scientists from Bell Labs débuted a computer that could synthesize human speech by having it sing "Daisy Bell" ("Daisy, Daisy, give me your answer, do . .
- Europe > United Kingdom > England (0.24)
- Europe > Germany > Bavaria > Upper Bavaria > Munich (0.04)
- Europe > France (0.04)
Evaluating and Personalizing User-Perceived Quality of Text-to-Speech Voices for Delivering Mindfulness Meditation with Different Physical Embodiments
Shi, Zhonghao, Chen, Han, Velentza, Anna-Maria, Liu, Siqi, Dennler, Nathaniel, O'Connell, Allison, Matarić, Maja
Mindfulness-based therapies have been shown to be effective in improving mental health, and technology-based methods have the potential to expand the accessibility of these therapies. To enable real-time personalized content generation for mindfulness practice in these methods, high-quality computer-synthesized text-to-speech (TTS) voices are needed to provide verbal guidance and respond to user performance and preferences. However, the user-perceived quality of state-of-the-art TTS voices has not yet been evaluated for administering mindfulness meditation, which requires emotional expressiveness. In addition, work has not yet been done to study the effect of physical embodiment and personalization on the user-perceived quality of TTS voices for mindfulness. To that end, we designed a two-phase human subject study. In Phase 1, an online Mechanical Turk between-subject study (N=471) evaluated 3 (feminine, masculine, child-like) state-of-the-art TTS voices with 2 (feminine, masculine) human therapists' voices in 3 different physical embodiment settings (no agent, conversational agent, socially assistive robot) with remote participants. Building on findings from Phase 1, in Phase 2, an in-person within-subject study (N=94), we used a novel framework we developed for personalizing TTS voices based on user preferences, and evaluated user-perceived quality compared to best-rated non-personalized voices from Phase 1. We found that the best-rated human voice was perceived better than all TTS voices; the emotional expressiveness and naturalness of TTS voices were poorly rated, while users were satisfied with the clarity of TTS voices. Surprisingly, by allowing users to fine-tune TTS voice features, the user-personalized TTS voices could perform almost as well as human voices, suggesting user personalization could be a simple and very effective tool to improve user-perceived quality of TTS voice.
- North America > United States > California > Los Angeles County > Los Angeles (0.29)
- Europe > Sweden > Stockholm > Stockholm (0.05)
- North America > United States > New York > New York County > New York City (0.04)
- (2 more...)
- Research Report > New Finding (1.00)
- Research Report > Experimental Study (1.00)
- Questionnaire & Opinion Survey (1.00)
- Research Report > Strength High (0.68)
- Health & Medicine > Consumer Health (1.00)
- Health & Medicine > Therapeutic Area > Psychiatry/Psychology > Mental Health (0.86)
Top 5 AI Voice Generators: Enhancing Your Business With Next-Gen Voice Solutions
Artificial intelligence (AI) has been advancing rapidly in recent years, and one area where it has made significant progress is in the generation of human-like voices. AI voice generators have emerged as game-changer. An AI voice generator is a type of software or technology that uses artificial intelligence (AI) algorithms to produce synthesized speech that sounds like a human voice. These generators, also known as text-to-speech (TTS) engines, convert written text into spoken words. These tools save time and money and offer a diverse range of realistic and human-like voices for various applications.
- Leisure & Entertainment (0.55)
- Media (0.34)
Microsoft's new VALL-E AI can capture your voice in 3 seconds
Microsoft researchers have presented an impressive new text-to-speech AI model, called Vall-E, which can listen to a voice for just a few seconds, then mimic that voice – including the emotional tone and acoustics – to say whatever you like. It's the latest of many AI algorithms that can harness a recording of a person's voice and make it say words and sentences that person never spoke – and it's remarkable for just how small a scrap of audio it needs in order to extrapolate an entire human voice. Where 2017's Lyrebird algorithm from the University of Montreal, for example, needed a full minute of speech to analyze, Vall-E needs just a three-second audio snippet. The AI has been trained on some 60,000 hours of English speech – mainly, it seems, by audiobook narrators, and the researchers have presented a swag of samples, in which Vall-E attempts to puppeteer a range of human voices. Some do a pretty extraordinary job of capturing the essence of the voice and building new sentences that sound natural – you'd struggle to tell which was the real voice and which was the synthesis. In others, the only giveaway is when the AI puts the emphasis in strange places in the sentence.
James Earl Jones done as Darth Vader, but his voice will live on because of AI
"Luke, I am your father" are five of the most famous words ever spoken on screen. When Darth Vader shattered Luke Skywalker's world in "The Empire Strikes Back," he sent shivers down the spines of audiences everywhere--in large part because of actor James Earl Jones' famous baritone. Now, Jones, 91, has announced he is hanging up the mask and retiring as the voice of one of the most infamous cinematic villains. But don't despair: Although Jones will no longer record new lines for Star Wars projects, the character--and Jones' voice--will live on thanks to artificial intelligence. As first reported by Vanity Fair, Respeecher, a Ukrainian voice synthesis company, will use a combination of archival recordings, voice acting and AI technology to continue bringing Darth Vader to the screen.
- Media > Film (1.00)
- Leisure & Entertainment (1.00)
Parkinson's and CANCER can be picked up in your VOICE with new app under development
A mobile app may soon be able to diagnose you with chronic health conditions using the sound of your voice. Scientists are building an artificial intelligence that analyzes vibrations in speech and breathing patterns to look for clues for illness. The National Institutes of Health is funding a mammoth research project to collect voice data that will build the AI. Experts already know that speech is altered by conditions like Parkinson's or stroke, while breathing is affected by lung diseases. But the hope is that the computer program will be able to diagnose a wide range of conditions - including cancer and depression.
- Health & Medicine > Therapeutic Area > Oncology (0.71)
- Health & Medicine > Health Care Technology > Telehealth (0.57)
Horses and pigs can distinguish between negative and positive sounds in human speech
From'Babe' to'Black Beauty', popular culture is constantly telling us that speaking to animals gently and'politely' is the best way to get them to do our bidding. Now a new study has shown the same is true in the real world, as domesticated animals like pigs and horses can tell the difference between negative and positive sounds in human speech. Researchers from the University of Copenhagen's Department of Biology and ETH Zurich found that the animals reacted react more strongly to'negatively charged' human voices. In some cases they even seemed to mirror the emotion expressed in the human voice, according to the researchers. 'That'll do, pig': The findings in the study backs up teachings in films like'Babe' where characters speak politely to their furry companions The stallion in'Black Beauty' goes through many good and bad owners, and researchers have found that this experience could have bearing on the wellbeing of real-life horses Researchers concluded that it is most likely that horses may be able to perceive and interpret each other's sounds by virtue of their common biology.
- Europe > Denmark > Capital Region > Copenhagen (0.26)
- Europe > Switzerland > Zürich > Zürich (0.25)
- Europe > France (0.05)