Not enough data to create a plot.
Try a different view from the menu above.
McAuley, Julian
Presto! Distilling Steps and Layers for Accelerating Music Generation
Novack, Zachary, Zhu, Ge, Casebeer, Jonah, McAuley, Julian, Berg-Kirkpatrick, Taylor, Bryan, Nicholas J.
Despite advances in diffusion-based text-to-music (TTM) methods, efficient, high-quality generation remains a challenge. We introduce Presto!, an approach to inference acceleration for score-based diffusion transformers via reducing both sampling steps and cost per step. To reduce steps, we develop a new score-based distribution matching distillation (DMD) method for the EDM-family of diffus ion models, the first GAN-based distillation method for TTM. To reduce the cost per step, we develop a simple, but powerful improvement to a recent layer distillation method that improves learning via better preserving hidden state variance. Finally, we combine our step and layer distillation methods together for a dual-faceted approach. We evaluate our step and layer distillation methods independently and show each yield best-in-class performance. Our combined distillation method can generate high-quality outputs with improved diversity, accelerating our base model by 10-18x (230/435ms latency for 32 second mono/stereo 44.1kHz, 15x faster than comparable SOTA) -- the fastest high-quality TTM to our knowledge. We have seen a renaissance of audio-domain generative media (Chen et al., 2024; Agostinelli et al., 2023; Liu et al., 2023; Copet et al., 2023), with increasing capabilities for both Text-to-Audio (TTA) and Text-to-Music (TTM) generation. This work has been driven in-part by audio-domain diffusion models (Song et al., 2020; Ho et al., 2020; Song et al., 2021), enabling considerably better audio modeling than generative adversarial network (GAN) or variational autoencoder (VAE) methods (Dhariwal & Nichol, 2021). Diffusion models, however, suffer from long inference times due to their iterative denoising process, requiring a substantial number of function evaluations (NFE) during inference (i.e.
CoLLAP: Contrastive Long-form Language-Audio Pretraining with Musical Temporal Structure Augmentation
Wu, Junda, Li, Warren, Novack, Zachary, Namburi, Amit, Chen, Carol, McAuley, Julian
Modeling temporal characteristics plays a significant role in the representation learning of audio waveform. We propose Contrastive Long-form Language-Audio Pretraining (\textbf{CoLLAP}) to significantly extend the perception window for both the input audio (up to 5 minutes) and the language descriptions (exceeding 250 words), while enabling contrastive learning across modalities and temporal dynamics. Leveraging recent Music-LLMs to generate long-form music captions for full-length songs, augmented with musical temporal structures, we collect 51.3K audio-text pairs derived from the large-scale AudioSet training dataset, where the average audio length reaches 288 seconds. We propose a novel contrastive learning architecture that fuses language representations with structured audio representations by segmenting each song into clips and extracting their embeddings. With an attention mechanism, we capture multimodal temporal correlations, allowing the model to automatically weigh and enhance the final fusion score for improved contrastive alignment. Finally, we develop two variants of the CoLLAP model with different types of backbone language models. Through comprehensive experiments on multiple long-form music-text retrieval datasets, we demonstrate consistent performance improvement in retrieval accuracy compared with baselines. We also show the pretrained CoLLAP models can be transferred to various music information retrieval tasks, with heterogeneous long-form multimodal contexts.
Self-Updatable Large Language Models with Parameter Integration
Wang, Yu, Liu, Xinshuang, Chen, Xiusi, O'Brien, Sean, Wu, Junda, McAuley, Julian
Despite significant advancements in large language models (LLMs), the rapid and frequent integration of small-scale experiences, such as interactions with surrounding objects, remains a substantial challenge. Two critical factors in assimilating these experiences are (1) Efficacy: the ability to accurately remember recent events; (2) Retention: the capacity to recall long-past experiences. Current methods either embed experiences within model parameters using continual learning, model editing, or knowledge distillation techniques, which often struggle with rapid updates and complex interactions, or rely on external storage to achieve long-term retention, thereby increasing storage requirements. In this paper, we propose SELF-PARAM (Self-Updatable Large Language Models with Parameter Integration). SELF-PARAM requires no extra parameters while ensuring near-optimal efficacy and long-term retention. Our method employs a training objective that minimizes the Kullback-Leibler (KL) divergence between the predictions of an original model (with access to contextual information) and a target model (without such access). By generating diverse question-answer pairs related to the knowledge and minimizing the KL divergence across this dataset, we update the target model to internalize the knowledge seamlessly within its parameters. Evaluations on question-answering and conversational recommendation tasks demonstrate that SELF-PARAM significantly outperforms existing methods, even when accounting for non-zero storage requirements. This advancement paves the way for more efficient and scalable integration of experiences in large language models by embedding knowledge directly into model parameters.
Federated Large Language Models: Current Progress and Future Directions
Yao, Yuhang, Zhang, Jianyi, Wu, Junda, Huang, Chengkai, Xia, Yu, Yu, Tong, Zhang, Ruiyi, Kim, Sungchul, Rossi, Ryan, Li, Ang, Yao, Lina, McAuley, Julian, Chen, Yiran, Joe-Wong, Carlee
Large language models are rapidly gaining popularity and have been widely adopted in real-world applications. While the quality of training data is essential, privacy concerns arise during data collection. Federated learning offers a solution by allowing multiple clients to collaboratively train LLMs without sharing local data. However, FL introduces new challenges, such as model convergence issues due to heterogeneous data and high communication costs. A comprehensive study is required to address these challenges and guide future research. This paper surveys Federated learning for LLMs (FedLLM), highlighting recent advances and future directions. We focus on two key aspects: fine-tuning and prompt learning in a federated setting, discussing existing work and associated research challenges. We finally propose potential research directions for federated LLMs, including pre-training and how LLMs can further enhance federated learning.
Towards LifeSpan Cognitive Systems
Wang, Yu, Han, Chi, Wu, Tongtong, He, Xiaoxin, Zhou, Wangchunshu, Sadeq, Nafis, Chen, Xiusi, He, Zexue, Wang, Wei, Haffari, Gholamreza, Ji, Heng, McAuley, Julian
Building a human-like system that continuously interacts with complex environments -- whether simulated digital worlds or human society -- presents several key challenges. Central to this is enabling continuous, high-frequency interactions, where the interactions are termed experiences. We refer to this envisioned system as the LifeSpan Cognitive System (LSCS). A critical feature of LSCS is its ability to engage in incremental and rapid updates while retaining and accurately recalling past experiences. We identify two major challenges in achieving this: (1) Abstraction and Experience Merging, and (2) Long-term Retention with Accurate Recall. These properties are essential for storing new experiences, organizing past experiences, and responding to the environment in ways that leverage relevant historical data. Unlike language models with continual learning, which typically rely on large corpora for fine-tuning and focus on improving performance within specific domains or tasks, LSCS must rapidly and incrementally update with new information from its environment at a high frequency. Existing technologies with the potential of solving the above two major challenges can be classified into four classes based on a conceptual metric called Storage Complexity, which measures the relative space required to store past experiences. Each of these four classes of technologies has its own strengths and limitations. Given that none of the existing technologies can achieve LSCS alone, we propose a novel paradigm for LSCS that integrates all four classes of technologies. The new paradigm operates through two core processes: Absorbing Experiences and Generating Responses.
Visual Prompting in Multimodal Large Language Models: A Survey
Wu, Junda, Zhang, Zhehao, Xia, Yu, Li, Xintong, Xia, Zhaoyang, Chang, Aaron, Yu, Tong, Kim, Sungchul, Rossi, Ryan A., Zhang, Ruiyi, Mitra, Subrata, Metaxas, Dimitris N., Yao, Lina, Shang, Jingbo, McAuley, Julian
Multimodal large language models (MLLMs) equip pre-trained large-language models (LLMs) with visual capabilities. While textual prompting in LLMs has been widely studied, visual prompting has emerged for more fine-grained and free-form visual instructions. This paper presents the first comprehensive survey on visual prompting methods in MLLMs, focusing on visual prompting, prompt generation, compositional reasoning, and prompt learning. We categorize existing visual prompts and discuss generative methods for automatic prompt annotations on the images. We also examine visual prompting methods that enable better alignment between visual encoders and backbone LLMs, concerning MLLM's visual grounding, object referring, and compositional reasoning abilities. In addition, we provide a summary of model training and in-context learning methods to improve MLLM's perception and understanding of visual prompts. This paper examines visual prompting methods developed in MLLMs and provides a vision of the future of these methods.
A Review of Modern Recommender Systems Using Generative Models (Gen-RecSys)
Deldjoo, Yashar, He, Zhankui, McAuley, Julian, Korikov, Anton, Sanner, Scott, Ramisa, Arnau, Vidal, Renรฉ, Sathiamoorthy, Maheswaran, Kasirzadeh, Atoosa, Milano, Silvia
Traditional recommender systems (RS) typically use user-item rating histories as their main data source. However, deep generative models now have the capability to model and sample from complex data distributions, including user-item interactions, text, images, and videos, enabling novel recommendation tasks. This comprehensive, multidisciplinary survey connects key advancements in RS using Generative Models (Gen-RecSys), covering: interaction-driven generative models; the use of large language models (LLM) and textual data for natural language recommendation; and the integration of multimodal models for generating and processing images/videos in RS. Our work highlights necessary paradigms for evaluating the impact and harm of Gen-RecSys and identifies open challenges. This survey accompanies a tutorial presented at ACM KDD'24, with supporting materials provided at: https://encr.pw/vDhLq.
Unified Generation, Reconstruction, and Representation: Generalized Diffusion with Adaptive Latent Encoding-Decoding
Liu, Guangyi, Wang, Yu, Feng, Zeyu, Wu, Qiyu, Tang, Liping, Gao, Yuan, Li, Zhen, Cui, Shuguang, McAuley, Julian, Yang, Zichao, Xing, Eric P., Hu, Zhiting
The vast applications of deep generative models are anchored in three core capabilities -- generating new instances, reconstructing inputs, and learning compact representations -- across various data types, such as discrete text/protein sequences and continuous images. Existing model families, like variational autoencoders (VAEs), generative adversarial networks (GANs), autoregressive models, and (latent) diffusion models, generally excel in specific capabilities and data types but fall short in others. We introduce Generalized Encoding-Decoding Diffusion Probabilistic Models (EDDPMs) which integrate the core capabilities for broad applicability and enhanced performance. EDDPMs generalize the Gaussian noising-denoising in standard diffusion by introducing parameterized encoding-decoding. Crucially, EDDPMs are compatible with the well-established diffusion model objective and training recipes, allowing effective learning of the encoder-decoder parameters jointly with diffusion. By choosing appropriate encoder/decoder (e.g., large language models), EDDPMs naturally apply to different data types. Extensive experiments on text, proteins, and images demonstrate the flexibility to handle diverse data and tasks and the strong improvement over various existing models.
DITTO-2: Distilled Diffusion Inference-Time T-Optimization for Music Generation
Novack, Zachary, McAuley, Julian, Berg-Kirkpatrick, Taylor, Bryan, Nicholas
Controllable music generation methods are critical for human-centered AI-based music creation, but are currently limited by speed, quality, and control design trade-offs. Diffusion Inference-Time T-optimization (DITTO), in particular, offers state-of-the-art results, but is over 10x slower than real-time, limiting practical use. We propose Distilled Diffusion Inference-Time T -Optimization (or DITTO-2), a new method to speed up inference-time optimization-based control and unlock faster-than-real-time generation for a wide-variety of applications such as music inpainting, outpainting, intensity, melody, and musical structure control. Our method works by (1) distilling a pre-trained diffusion model for fast sampling via an efficient, modified consistency or consistency trajectory distillation process (2) performing inference-time optimization using our distilled model with one-step sampling as an efficient surrogate optimization task and (3) running a final multi-step sampling generation (decoding) using our estimated noise latents for best-quality, fast, controllable generation. Through thorough evaluation, we find our method not only speeds up generation over 10-20x, but simultaneously improves control adherence and generation quality all at once. Furthermore, we apply our approach to a new application of maximizing text adherence (CLAP score) and show we can convert an unconditional diffusion model without text inputs into a model that yields state-of-the-art text control. Sound examples can be found at https://ditto-music.github.io/ditto2/.
Large Scale Knowledge Washing
Wang, Yu, Wu, Ruihan, He, Zexue, Chen, Xiusi, McAuley, Julian
Large language models show impressive abilities in memorizing world knowledge, which leads to concerns regarding memorization of private information, toxic or sensitive knowledge, and copyrighted content. We introduce the problem of Large Scale Knowledge Washing, focusing on unlearning an extensive amount of factual knowledge. Previous unlearning methods usually define the reverse loss and update the model via backpropagation, which may affect the model's fluency and reasoning ability or even destroy the model due to extensive training with the reverse loss. Existing works introduce additional data from downstream tasks to prevent the model from losing capabilities, which requires downstream task awareness. Controlling the tradeoff of unlearning and maintaining existing capabilities is also challenging. To this end, we propose LAW (Large Scale Washing) to update the MLP layers in decoder-only large language models to perform knowledge washing, as inspired by model editing methods and based on the hypothesis that knowledge and reasoning are disentanglable. We derive a new objective with the knowledge to be unlearned to update the weights of certain MLP layers. Experimental results demonstrate the effectiveness of LAW in forgetting target knowledge while maintaining reasoning ability. The code will be open-sourced at https://github.com/wangyu-ustc/LargeScaleWashing.