AITopics | Tjandra, Andros

Collaborating Authors

Tjandra, Andros

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Meta Audiobox Aesthetics: Unified Automatic Quality Assessment for Speech, Music, and Sound

Tjandra, Andros, Wu, Yi-Chiao, Guo, Baishan, Hoffman, John, Ellis, Brian, Vyas, Apoorv, Shi, Bowen, Chen, Sanyuan, Le, Matt, Zacharov, Nick, Wood, Carleigh, Lee, Ann, Hsu, Wei-Ning

arXiv.org Artificial IntelligenceFeb-7-2025

The quantification of audio aesthetics remains a complex challenge in audio processing, primarily due to its subjective nature, which is influenced by human perception and cultural context. Traditional methods often depend on human listeners for evaluation, leading to inconsistencies and high resource demands. This paper addresses the growing need for automated systems capable of predicting audio aesthetics without human intervention. Such systems are crucial for applications like data filtering, pseudo-labeling large datasets, and evaluating generative audio models, especially as these models become more sophisticated. In this work, we introduce a novel approach to audio aesthetic evaluation by proposing new annotation guidelines that decompose human listening perspectives into four distinct axes. We develop and train no-reference, per-item prediction models that offer a more nuanced assessment of audio quality. Our models are evaluated against human mean opinion scores (MOS) and existing methods, demonstrating comparable or superior performance. This research not only advances the field of audio aesthetics but also provides open-source models and datasets to facilitate future work and benchmarking. We release our code and pre-trained model at: https://github.com/facebookresearch/audiobox-aesthetics

artificial intelligence, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2502.05139

Country: North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)

Genre: Research Report (1.00)

Industry:

Media > Music (0.68)
Leisure & Entertainment (0.68)

Technology:

Information Technology > Artificial Intelligence > Speech (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

MusicFlow: Cascaded Flow Matching for Text Guided Music Generation

Prajwal, K R, Shi, Bowen, Lee, Matthew, Vyas, Apoorv, Tjandra, Andros, Luthra, Mahi, Guo, Baishan, Wang, Huiyu, Afouras, Triantafyllos, Kant, David, Hsu, Wei-Ning

arXiv.org Artificial IntelligenceOct-27-2024

We introduce MusicFlow, a cascaded text-to-music generation model based on flow matching. Based on self-supervised representations to bridge between text descriptions and music audios, we construct two flow matching networks to model the conditional distribution of semantic and acoustic features. Additionally, we leverage masked prediction as the training objective, enabling the model to generalize to other tasks such as music infilling and continuation in a zero-shot manner. Experiments on MusicCaps reveal that the music generated by MusicFlow exhibits superior quality and text coherence despite being over $2\sim5$ times smaller and requiring $5$ times fewer iterative steps. Simultaneously, the model can perform other music generation tasks and achieves competitive performance in music infilling and continuation. Our code and model will be publicly available.

artificial intelligence, machine learning, natural language, (14 more...)

arXiv.org Artificial Intelligence

2410.20478

Country: Europe > Austria (0.28)

Genre: Research Report (1.00)

Industry:

Media > Music (1.00)
Leisure & Entertainment (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.68)

Add feedback

Movie Gen: A Cast of Media Foundation Models

Polyak, Adam, Zohar, Amit, Brown, Andrew, Tjandra, Andros, Sinha, Animesh, Lee, Ann, Vyas, Apoorv, Shi, Bowen, Ma, Chih-Yao, Chuang, Ching-Yao, Yan, David, Choudhary, Dhruv, Wang, Dingkang, Sethi, Geet, Pang, Guan, Ma, Haoyu, Misra, Ishan, Hou, Ji, Wang, Jialiang, Jagadeesh, Kiran, Li, Kunpeng, Zhang, Luxin, Singh, Mannat, Williamson, Mary, Le, Matt, Yu, Matthew, Singh, Mitesh Kumar, Zhang, Peizhao, Vajda, Peter, Duval, Quentin, Girdhar, Rohit, Sumbaly, Roshan, Rambhatla, Sai Saketh, Tsai, Sam, Azadi, Samaneh, Datta, Samyak, Chen, Sanyuan, Bell, Sean, Ramaswamy, Sharadh, Sheynin, Shelly, Bhattacharya, Siddharth, Motwani, Simran, Xu, Tao, Li, Tianhe, Hou, Tingbo, Hsu, Wei-Ning, Yin, Xi, Dai, Xiaoliang, Taigman, Yaniv, Luo, Yaqiao, Liu, Yen-Cheng, Wu, Yi-Chiao, Zhao, Yue, Kirstain, Yuval, He, Zecheng, He, Zijian, Pumarola, Albert, Thabet, Ali, Sanakoyeu, Artsiom, Mallya, Arun, Guo, Baishan, Araya, Boris, Kerr, Breena, Wood, Carleigh, Liu, Ce, Peng, Cen, Vengertsev, Dimitry, Schonfeld, Edgar, Blanchard, Elliot, Juefei-Xu, Felix, Nord, Fraylie, Liang, Jeff, Hoffman, John, Kohler, Jonas, Fire, Kaolin, Sivakumar, Karthik, Chen, Lawrence, Yu, Licheng, Gao, Luya, Georgopoulos, Markos, Moritz, Rashel, Sampson, Sara K., Li, Shikai, Parmeggiani, Simone, Fine, Steve, Fowler, Tara, Petrovic, Vladan, Du, Yuming

arXiv.org Artificial IntelligenceOct-17-2024

We present Movie Gen, a cast of foundation models that generates high-quality, 1080p HD videos with different aspect ratios and synchronized audio. We also show additional capabilities such as precise instruction-based video editing and generation of personalized videos based on a user's image. Our models set a new state-of-the-art on multiple tasks: text-to-video synthesis, video personalization, video editing, video-to-audio generation, and text-to-audio generation. Our largest video generation model is a 30B parameter transformer trained with a maximum context length of 73K video tokens, corresponding to a generated video of 16 seconds at 16 frames-per-second. We show multiple technical innovations and simplifications on the architecture, latent spaces, training objectives and recipes, data curation, evaluation protocols, parallelization techniques, and inference optimizations that allow us to reap the benefits of scaling pre-training data, model size, and training compute for training large scale media generation models. We hope this paper helps the research community to accelerate progress and innovation in media generation models. All videos from this paper are available at https://go.fb.me/MovieGenResearchVideos.

large language model, machine learning, natural language, (24 more...)

arXiv.org Artificial Intelligence

2410.1372

Country: Asia (0.45)

Genre:

Research Report > New Finding (1.00)
Overview (0.92)

Industry:

Media > Music (1.00)
Media > Film (1.00)
Leisure & Entertainment (1.00)

Technology:

Information Technology > Sensing and Signal Processing > Image Processing (1.00)
Information Technology > Graphics (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
(6 more...)

Add feedback

Learning Fine-Grained Controllability on Speech Generation via Efficient Fine-Tuning

Chien, Chung-Ming, Tjandra, Andros, Vyas, Apoorv, Le, Matt, Shi, Bowen, Hsu, Wei-Ning

arXiv.org Artificial IntelligenceJun-10-2024

In this work, we propose Voicebox Adapter, Our contributions are as follows: (1) we propose Voicebox a novel approach that integrates fine-grained conditions into a Adapter, which augments Voicebox, a pre-trained speech pre-trained Voicebox speech generation model using a crossattention generation model, with fine-grained controllability; (2) we explore module. To ensure a smooth integration of newly different efficient fine-tuning methods to bridge the gap added modules with pre-trained ones, we explore various efficient between pre-trained parameters and new fine-grained conditioning fine-tuning approaches. Our experiment shows that the modules; (3) we show that Voicebox Adapter can generalize LoRA with bias-tuning configuration yields the best performance, across various fine-grained conditions, attaining performance enhancing controllability without compromising speech comparable to that achieved by fine-tuning the entire model quality. Across three fine-grained conditional generation tasks, with significantly fewer fine-tuned parameters; (4) we conduct we demonstrate the effectiveness and resource efficiency of experiments using varying amounts of fine-tuning data and different Voicebox Adapter. Follow-up experiments further highlight the hidden dimension sizes, analyzing the performance of robustness of Voicebox Adapter across diverse data setups.

large language model, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2406.06251

Country:

North America > United States (0.14)
Europe (0.14)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Speech (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.47)

Add feedback

Dynamic ASR Pathways: An Adaptive Masking Approach Towards Efficient Pruning of A Multilingual ASR Model

Xie, Jiamin, Li, Ke, Guo, Jinxi, Tjandra, Andros, Shangguan, Yuan, Sari, Leda, Wu, Chunyang, Jia, Junteng, Mahadeokar, Jay, Kalinli, Ozlem

arXiv.org Artificial IntelligenceJan-11-2024

Neural network pruning offers an effective method for compressing a multilingual automatic speech recognition (ASR) model with minimal performance loss. However, it entails several rounds of pruning and re-training needed to be run for each language. In this work, we propose the use of an adaptive masking approach in two scenarios for pruning a multilingual ASR model efficiently, each resulting in sparse monolingual models or a sparse multilingual model (named as Dynamic ASR Pathways). Our approach dynamically adapts the sub-network, avoiding premature decisions about a fixed sub-network structure. We show that our approach outperforms existing pruning methods when targeting sparse monolingual models. Further, we illustrate that Dynamic ASR Pathways jointly discovers and trains better sub-networks (pathways) of a single multilingual model by adapting from different sub-network initializations, thereby reducing the need for language-specific pruning.

machine learning, natural language, pruning, (14 more...)

arXiv.org Artificial Intelligence

2309.13018

Country: North America > United States (0.14)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.90)
Information Technology > Artificial Intelligence > Speech > Speech Recognition (0.57)

Add feedback

Audiobox: Unified Audio Generation with Natural Language Prompts

Vyas, Apoorv, Shi, Bowen, Le, Matthew, Tjandra, Andros, Wu, Yi-Chiao, Guo, Baishan, Zhang, Jiemin, Zhang, Xinyue, Adkins, Robert, Ngan, William, Wang, Jeff, Cruz, Ivan, Akula, Bapi, Akinyemi, Akinniyi, Ellis, Brian, Moritz, Rashel, Yungster, Yael, Rakotoarison, Alice, Tan, Liang, Summers, Chris, Wood, Carleigh, Lane, Joshua, Williamson, Mary, Hsu, Wei-Ning

arXiv.org Artificial IntelligenceDec-25-2023

Audio is an essential part of our life, but creating it often requires expertise and is time-consuming. Research communities have made great progress over the past year advancing the performance of large scale audio generative models for a single modality (speech, sound, or music) through adopting more powerful generative models and scaling data. However, these models lack controllability in several aspects: speech generation models cannot synthesize novel styles based on text description and are limited on domain coverage such as outdoor environments; sound generation models only provide coarse-grained control based on descriptions like "a person speaking" and would only generate mumbling human voices. This paper presents Audiobox, a unified model based on flow-matching that is capable of generating various audio modalities. We design description-based and example-based prompting to enhance controllability and unify speech and sound generation paradigms. We allow transcript, vocal, and other audio styles to be controlled independently when generating speech. To improve model generalization with limited labels, we adapt a self-supervised infilling objective to pre-train on large quantities of unlabeled audio. Audiobox sets new benchmarks on speech and sound generation (0.745 similarity on Librispeech for zero-shot TTS; 0.77 FAD on AudioCaps for text-to-sound) and unlocks new methods for generating audio with novel vocal and acoustic styles. We further integrate Bespoke Solvers, which speeds up generation by over 25 times compared to the default ODE solver for flow-matching, without loss of performance on several tasks. Our demo is available at https://audiobox.metademolab.com/

large language model, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2312.15821

Country:

North America > Canada (0.14)
Europe > Germany (0.14)

Genre: Research Report (1.00)

Industry:

Media (0.68)
Leisure & Entertainment (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.93)
(2 more...)

Add feedback

Generative Pre-training for Speech with Flow Matching

Liu, Alexander H., Le, Matt, Vyas, Apoorv, Shi, Bowen, Tjandra, Andros, Hsu, Wei-Ning

arXiv.org Artificial IntelligenceOct-24-2023

Generative models have gained more and more attention in recent years for their remarkable success in tasks that required estimating and sampling data distribution to generate high-fidelity synthetic data. In speech, text-to-speech synthesis and neural vocoder are good examples where generative models have shined. While generative models have been applied to different applications in speech, there exists no general-purpose generative model that models speech directly. In this work, we take a step toward this direction by showing a single pre-trained generative model can be adapted to different downstream tasks with strong performance. Specifically, we pre-trained a generative model, named SpeechFlow, on 60k hours of untranscribed speech with Flow Matching and masked conditions. Experiment results show the pre-trained generative model can be fine-tuned with task-specific data to match or surpass existing expert models on speech enhancement, separation, and synthesis. Our work suggested a foundational model for generation tasks in speech can be built with generative pre-training.

artificial intelligence, generative pre-training, speech synthesis, (1 more...)

arXiv.org Artificial Intelligence

2310.16338

Genre: Research Report (0.69)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (0.60)
Information Technology > Artificial Intelligence > Speech > Speech Synthesis (0.53)

Add feedback

Voice-preserving Zero-shot Multiple Accent Conversion

Jin, Mumin, Serai, Prashant, Wu, Jilong, Tjandra, Andros, Manohar, Vimal, He, Qing

arXiv.org Artificial IntelligenceOct-14-2023

Most people who have tried to learn a foreign language would have experienced difficulties understanding or speaking with a native speaker's accent. For native speakers, understanding or speaking a new accent is likewise a difficult task. An accent conversion system that changes a speaker's accent but preserves that speaker's voice identity, such as timbre and pitch, has the potential for a range of applications, such as communication, language learning, and entertainment. Existing accent conversion models tend to change the speaker identity and accent at the same time. Here, we use adversarial learning to disentangle accent dependent features while retaining other acoustic characteristics. What sets our work apart from existing accent conversion models is the capability to convert an unseen speaker's utterance to multiple accents while preserving its original voice identity. Subjective evaluations show that our model generates audio that sound closer to the target accent and like the original speaker.

large language model, machine learning, natural language, (16 more...)

arXiv.org Artificial Intelligence

2211.13282

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.42)

Add feedback

Learning ASR pathways: A sparse multilingual ASR model

Yang, Mu, Tjandra, Andros, Liu, Chunxi, Zhang, David, Le, Duc, Kalinli, Ozlem

arXiv.org Artificial IntelligenceSep-28-2023

Neural network pruning compresses automatic speech recognition (ASR) models effectively. However, in multilingual ASR, language-agnostic pruning may lead to severe performance drops on some languages because language-agnostic pruning masks may not fit all languages and discard important language-specific parameters. In this work, we present ASR pathways, a sparse multilingual ASR model that activates language-specific sub-networks ("pathways"), such that the parameters for each language are learned explicitly. With the overlapping sub-networks, the shared parameters can also enable knowledge transfer for lower-resource languages via joint multilingual training. We propose a novel algorithm to learn ASR pathways, and evaluate the proposed method on 4 languages with a streaming RNN-T model. Our proposed ASR pathways outperform both dense models and a language-agnostically pruned model, and provide better performance on low-resource languages compared to the monolingual sparse models.

artificial intelligence, language-specific mask, machine learning, (15 more...)

arXiv.org Artificial Intelligence

2209.05735

Country: North America > United States (0.14)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.67)
Information Technology > Artificial Intelligence > Speech > Speech Recognition (0.59)

Add feedback

Scaling Speech Technology to 1,000+ Languages

Pratap, Vineel, Tjandra, Andros, Shi, Bowen, Tomasello, Paden, Babu, Arun, Kundu, Sayani, Elkahky, Ali, Ni, Zhaoheng, Vyas, Apoorv, Fazel-Zarandi, Maryam, Baevski, Alexei, Adi, Yossi, Zhang, Xiaohui, Hsu, Wei-Ning, Conneau, Alexis, Auli, Michael

arXiv.org Artificial IntelligenceMay-22-2023

Expanding the language coverage of speech technology has the potential to improve access to information for many more people. However, current speech technology is restricted to about one hundred languages which is a small fraction of the over 7,000 languages spoken around the world. The Massively Multilingual Speech (MMS) project increases the number of supported languages by 10-40x, depending on the task. The main ingredients are a new dataset based on readings of publicly available religious texts and effectively leveraging self-supervised learning. We built pre-trained wav2vec 2.0 models covering 1,406 languages, a single multilingual automatic speech recognition model for 1,107 languages, speech synthesis models for the same number of languages, as well as a language identification model for 4,017 languages. Experiments show that our multilingual speech recognition model more than halves the word error rate of Whisper on 54 languages of the FLEURS benchmark while being trained on a small fraction of the labeled data.

artificial intelligence, deep learning, machine learning, (18 more...)

arXiv.org Artificial Intelligence

2305.13516

Country: Europe (0.67)

Genre: Research Report (0.81)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.67)

Add feedback