Goto

Collaborating Authors

 melnet


MelNet: A Real-Time Deep Learning Algorithm for Object Detection

Azadvatan, Yashar, Kurt, Murat

arXiv.org Artificial Intelligence

In this study, a novel deep learning algorithm for object detection, named MelNet, was introduced. MelNet underwent training utilizing the KITTI dataset for object detection. Following 300 training epochs, MelNet attained an mAP (mean average precision) score of 0.732. Additionally, three alternative models -YOLOv5, EfficientDet, and Faster-RCNN-MobileNetv3- were trained on the KITTI dataset and juxtaposed with MelNet for object detection. The outcomes underscore the efficacy of employing transfer learning in certain instances. Notably, preexisting models trained on prominent datasets (e.g., ImageNet, COCO, and Pascal VOC) yield superior results. Another finding underscores the viability of creating a new model tailored to a specific scenario and training it on a specific dataset. This investigation demonstrates that training MelNet exclusively on the KITTI dataset also surpasses EfficientDet after 150 epochs. Consequently, post-training, MelNet's performance closely aligns with that of other pre-trained models.


Facebook's AI system can speak with Bill Gates's voice

#artificialintelligence

The slow progress on realistic text-to-speech systems is not from lack of trying. Numerous teams have attempted to train deep-learning algorithms to reproduce real speech patterns using large databases of audio. The problem with this approach, say Vasquez and Lewis, is with the type of data. Until now, most work has focused on audio waveform recordings. These show how the amplitude of sound changes over time, with each second of recorded audio consisting of tens of thousands of time steps.


Listen to this AI voice clone of Bill Gates created by Facebook's engineers

#artificialintelligence

We're headed for a revolution in computer-generated speech, and a voice clone of Microsoft founder Bill Gates demonstrates exactly why. In the clips embedded below, you can listen to what seems to be Gates reeling off a series of innocuous phrases. "A cramp is no small danger on a swim," he cautions. "Write a fond note to the friend you cherish," he advises. But each voice clip has been generated by a machine learning system named MelNet, designed and created by engineers at Facebook.


6 Ways Speech Synthesis Is Being Powered By Deep Learning

#artificialintelligence

This model was open sourced back in June 2019 as an implementation of the paper Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis. This service is being offered by Resemble.ai. With this product, one can clone any voice and create dynamic, iterable, and unique voice content. Users input a short voice sample and the model -- trained only during playback time -- can immediately deliver text-to-speech utterances in the style of the sampled voice. Bengaluru's Deepsync offers an Augmented Intelligence that learns the way you speak.


Bill Gates, Stephen Hawking get AI voice clones, thanks to Facebook engineers

#artificialintelligence

Using Artificial Intelligence, two Facebook engineers have now successfully cloned the voices of famous personalities including Microsoft cofounder Bill Gates, late theoretical physicist Stephen Hawking, and American actor George Takei among few others. Mike Lewis and Sean Vasquez, the two Facebook engineers developed a computer generated speech system called MelNet using Artificial Intelligence. Not just the voices of famous personalities, they have also created voice and music samples using AI. In a recently published research paper, they mentioned relying on machine learning for the convincing AI generated voice clips. Apart from Bill Gates, Stephen Hawking, and George Takei, others whose voice have been cloned are – primatologist Jane Goodall, professors Daphne Koller, Fei Fei Li, scientist Stephen Wolfram and Khan Academy founder Sal Khan.


MelNet: A Generative Model for Audio in the Frequency Domain

Vasquez, Sean, Lewis, Mike

arXiv.org Machine Learning

Capturing high-level structure in audio waveforms is challenging because a single second of audio spans tens of thousands of timesteps. While long-range dependencies are difficult to model directly in the time domain, we show that they can be more tractably modelled in two-dimensional time-frequency representations such as spectrograms. By leveraging this representational advantage, in conjunction with a highly expressive probabilistic model and a multiscale generation procedure, we design a model capable of generating high-fidelity audio samples which capture structure at timescales that time-domain models have yet to achieve. We apply our model to a variety of audio generation tasks, including unconditional speech generation, music generation, and text-to-speech synthesis---showing improvements over previous approaches in both density estimates and human judgments.