AITopics | Le, Matthew

Collaborating Authors

Le, Matthew

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

FlowDec: A flow-based full-band general audio codec with high perceptual quality

Welker, Simon, Le, Matthew, Chen, Ricky T. Q., Hsu, Wei-Ning, Gerkmann, Timo, Richard, Alexander, Wu, Yi-Chiao

arXiv.org Artificial IntelligenceMar-3-2025

We propose FlowDec, a neural full-band audio codec for general audio sampled at 48 kHz that combines non-adversarial codec training with a stochastic postfilter based on a novel conditional flow matching method. Compared to the prior work ScoreDec which is based on score matching, we generalize from speech to general audio and move from 24 kbit/s to as low as 4 kbit/s, while improving output quality and reducing the required postfilter DNN evaluations from 60 to 6 without any fine-tuning or distillation techniques. We provide theoretical insights and geometric intuitions for our approach in comparison to ScoreDec as well as another recent work that uses flow matching, and conduct ablation studies on our proposed components. We show that FlowDec is a competitive alternative to the recent GAN-dominated stream of neural codecs, achieving FAD scores better than those of the established GAN-based codec DAC and listening test scores that are on par, and producing qualitatively more natural reconstructions for speech and harmonic structures in music.

artificial intelligence, deep learning, machine learning, (14 more...)

arXiv.org Artificial Intelligence

2503.01485

Country:

North America > United States (0.14)
Europe > Germany (0.14)
Europe > France (0.14)
Asia > Middle East (0.14)

Genre: Research Report > New Finding (0.46)

Technology:

Information Technology > Artificial Intelligence > Speech (0.69)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

Bespoke Non-Stationary Solvers for Fast Sampling of Diffusion and Flow Models

Shaul, Neta, Singer, Uriel, Chen, Ricky T. Q., Le, Matthew, Thabet, Ali, Pumarola, Albert, Lipman, Yaron

arXiv.org Artificial IntelligenceMar-2-2024

This paper introduces Bespoke Non-Stationary (BNS) Solvers, a solver distillation approach to improve sample efficiency of Diffusion and Flow models. BNS solvers are based on a family of non-stationary solvers that provably subsumes existing numerical ODE solvers and consequently demonstrate considerable improvement in sample approximation (PSNR) over these baselines. Compared to model distillation, BNS solvers benefit from a tiny parameter space ($<$200 parameters), fast optimization (two orders of magnitude faster), maintain diversity of samples, and in contrast to previous solver distillation approaches nearly close the gap from standard distillation methods such as Progressive Distillation in the low-medium NFE regime. For example, BNS solver achieves 45 PSNR / 1.76 FID using 16 NFE in class-conditional ImageNet-64. We experimented with BNS solvers for conditional image generation, text-to-image generation, and text-2-audio generation showing significant improvement in sample approximation (PSNR) in all.

artificial intelligence, machine learning, solver, (16 more...)

arXiv.org Artificial Intelligence

2403.01329

Country: North America > Canada > Ontario > Toronto (0.14)

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.69)

Add feedback

Audiobox: Unified Audio Generation with Natural Language Prompts

Vyas, Apoorv, Shi, Bowen, Le, Matthew, Tjandra, Andros, Wu, Yi-Chiao, Guo, Baishan, Zhang, Jiemin, Zhang, Xinyue, Adkins, Robert, Ngan, William, Wang, Jeff, Cruz, Ivan, Akula, Bapi, Akinyemi, Akinniyi, Ellis, Brian, Moritz, Rashel, Yungster, Yael, Rakotoarison, Alice, Tan, Liang, Summers, Chris, Wood, Carleigh, Lane, Joshua, Williamson, Mary, Hsu, Wei-Ning

arXiv.org Artificial IntelligenceDec-25-2023

Audio is an essential part of our life, but creating it often requires expertise and is time-consuming. Research communities have made great progress over the past year advancing the performance of large scale audio generative models for a single modality (speech, sound, or music) through adopting more powerful generative models and scaling data. However, these models lack controllability in several aspects: speech generation models cannot synthesize novel styles based on text description and are limited on domain coverage such as outdoor environments; sound generation models only provide coarse-grained control based on descriptions like "a person speaking" and would only generate mumbling human voices. This paper presents Audiobox, a unified model based on flow-matching that is capable of generating various audio modalities. We design description-based and example-based prompting to enhance controllability and unify speech and sound generation paradigms. We allow transcript, vocal, and other audio styles to be controlled independently when generating speech. To improve model generalization with limited labels, we adapt a self-supervised infilling objective to pre-train on large quantities of unlabeled audio. Audiobox sets new benchmarks on speech and sound generation (0.745 similarity on Librispeech for zero-shot TTS; 0.77 FAD on AudioCaps for text-to-sound) and unlocks new methods for generating audio with novel vocal and acoustic styles. We further integrate Bespoke Solvers, which speeds up generation by over 25 times compared to the default ODE solver for flow-matching, without loss of performance on several tasks. Our demo is available at https://audiobox.metademolab.com/

large language model, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2312.15821

Country:

North America > Canada (0.14)
Europe > Germany (0.14)

Genre: Research Report (1.00)

Industry:

Media (0.68)
Leisure & Entertainment (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.93)
(2 more...)

Add feedback

Voicebox: Text-Guided Multilingual Universal Speech Generation at Scale

Le, Matthew, Vyas, Apoorv, Shi, Bowen, Karrer, Brian, Sari, Leda, Moritz, Rashel, Williamson, Mary, Manohar, Vimal, Adi, Yossi, Mahadeokar, Jay, Hsu, Wei-Ning

arXiv.org Artificial IntelligenceOct-19-2023

Large-scale generative models such as GPT and DALL-E have revolutionized the research community. These models not only generate high fidelity outputs, but are also generalists which can solve tasks not explicitly taught. In contrast, speech generative models are still primitive in terms of scale and task generalization. In this paper, we present Voicebox, the most versatile text-guided generative model for speech at scale. Voicebox is a non-autoregressive flow-matching model trained to infill speech, given audio context and text, trained on over 50K hours of speech that are not filtered or enhanced. Similar to GPT, Voicebox can perform many different tasks through in-context learning, but is more flexible as it can also condition on future context. Voicebox can be used for mono or cross-lingual zero-shot text-to-speech synthesis, noise removal, content editing, style conversion, and diverse sample generation. In particular, Voicebox outperforms the state-of-the-art zero-shot TTS model VALL-E on both intelligibility (5.9% vs 1.9% word error rates) and audio similarity (0.580 vs 0.681) while being up to 20 times faster. Audio samples can be found in \url{https://voicebox.metademolab.com}.

large language model, machine learning, natural language, (21 more...)

arXiv.org Artificial Intelligence

2306.15687

Country:

Europe > Germany (0.14)
Asia > Middle East (0.14)

Genre: Research Report > New Finding (0.67)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.87)

Add feedback

Latent Discretization for Continuous-time Sequence Compression

Chen, Ricky T. Q., Le, Matthew, Muckley, Matthew, Nickel, Maximilian, Ullrich, Karen

arXiv.org Artificial IntelligenceDec-27-2022

Neural compression offers a domain-agnostic approach to creating codecs for lossy or lossless compression via deep generative models. For sequence compression, however, most deep sequence models have costs that scale with the sequence length rather than the sequence complexity. In this work, we instead treat data sequences as observations from an underlying continuous-time process and learn how to efficiently discretize while retaining information about the full sequence. As a consequence of decoupling sequential information from its temporal discretization, our approach allows for greater compression rates and smaller computational complexity. Moreover, the continuous-time approach naturally allows us to decode at different time intervals. We empirically verify our approach on multiple domains involving compression of video and motion capture sequences, showing that our approaches can automatically achieve reductions in bit rates by learning how to discretize.

artificial intelligence, machine learning, natural language, (14 more...)

arXiv.org Artificial Intelligence

2212.13659

Genre:

Research Report (0.50)
Instructional Material > Course Syllabus & Notes (0.34)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Spatio-Temporal Consistency as a Means to Identify Unlabeled Objects in a Continuous Data Field

Faghmous, James (University of Minnesota) | Nguyen, Hung (University of Minnesota) | Le, Matthew (Rochester Institute of Technology) | Kumar, Vipin (University of Minnesota)

AAAI ConferencesJul-14-2014

Mesoscale ocean eddies are a critical component of the Earth System as they dominate the ocean's kinetic energy and impact the global distribution of oceanic heat, salinity, momentum, and nutrients. Therefore, accurately representing these dynamic features is critical for our planet's sustainability. The majority of methods that identify eddies from satellite observations analyze the data in a frame-by-frame basis despite the fact that eddies are dynamic objects that propagate across space and time. We introduce the notion of spatio-temporal consistency to identify eddies in a continuous spatio-temporal field, to simultaneously ensure that the features detected are both spatially and temporally consistent. Our spatio-temporal consistency approach allows us to remove most of the expert criteria used in traditional methods to reduce false negatives. The removal of arbitrary heuristics enables us to render more complete eddy dynamics by identifying smaller and longer lived eddies compared to existing methods.

artificial intelligence, eddy, machine learning, (18 more...)

AAAI Conferences

Twenty-Eighth AAAI Conference on Artificial Intelligence

Country:

North America > United States (0.14)
Atlantic Ocean (0.14)

Genre: Research Report (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Data Science > Data Mining (0.68)

Add feedback