AITopics | Murata, Naoki

Plotting

Murata, Naoki

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Transformed Low-rank Adaptation via Tensor Decomposition and Its Applications to Text-to-image Models

Tao, Zerui, Takida, Yuhta, Murata, Naoki, Zhao, Qibin, Mitsufuji, Yuki

arXiv.org Artificial IntelligenceJan-15-2025

Parameter-Efficient Fine-Tuning (PEFT) of text-to-image models has become an increasingly popular technique with many applications. Among the various PEFT methods, Low-Rank Adaptation (LoRA) and its variants have gained significant attention due to their effectiveness, enabling users to fine-tune models with limited computational resources. However, the approximation gap between the low-rank assumption and desired fine-tuning weights prevents the simultaneous acquisition of ultra-parameter-efficiency and better performance. To reduce this gap and further improve the power of LoRA, we propose a new PEFT method that combines two classes of adaptations, namely, transform and residual adaptations. In specific, we first apply a full-rank and dense transform to the pre-trained weight. This learnable transform is expected to align the pre-trained weight as closely as possible to the desired weight, thereby reducing the rank of the residual weight. Then, the residual part can be effectively approximated by more compact and parameter-efficient structures, with a smaller approximation error. To achieve ultra-parameter-efficiency in practice, we design highly flexible and effective tensor decompositions for both the transform and residual adaptations. Additionally, popular PEFT methods such as DoRA can be summarized under this transform plus residual adaptation scheme. Experiments are conducted on fine-tuning Stable Diffusion models in subject-driven and controllable generation. The results manifest that our method can achieve better performances and parameter efficiency compared to LoRA and several baselines.

artificial intelligence, deep learning, machine learning, (18 more...)

arXiv.org Artificial Intelligence

2501.08727

Genre: Research Report > New Finding (0.46)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.88)

Add feedback

Blind Inverse Problem Solving Made Easy by Text-to-Image Latent Diffusion

Dontas, Michail, He, Yutong, Murata, Naoki, Mitsufuji, Yuki, Kolter, J. Zico, Salakhutdinov, Ruslan

arXiv.org Artificial IntelligenceNov-30-2024

Blind inverse problems, where both the target data and forward operator are unknown, are crucial to many computer vision applications. Existing methods often depend on restrictive assumptions such as additional training, operator linearity, or narrow image distributions, thus limiting their generalizability. In this work, we present LADiBI, a training-free framework that uses large-scale text-to-image diffusion models to solve blind inverse problems with minimal assumptions. By leveraging natural language prompts, LADiBI jointly models priors for both the target image and operator, allowing for flexible adaptation across a variety of tasks. Additionally, we propose a novel posterior sampling approach that combines effective operator initialization with iterative refinement, enabling LADiBI to operate without predefined operator forms. Our experiments show that LADiBI is capable of solving a broad range of image restoration tasks, including both linear and nonlinear problems, on diverse target image distributions.

artificial intelligence, inverse problem, machine learning, (17 more...)

arXiv.org Artificial Intelligence

2412.00557

Country: North America > United States (0.28)

Genre: Research Report (1.00)

Technology:

Information Technology > Sensing and Signal Processing > Image Processing (1.00)
Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

Add feedback

Mitigating Embedding Collapse in Diffusion Models for Categorical Data

Nguyen, Bac, Lai, and Chieh-Hsin, Takida, Yuhta, Murata, Naoki, Uesaka, Toshimitsu, Ermon, Stefano, Mitsufuji, Yuki

arXiv.org Artificial IntelligenceOct-18-2024

Latent diffusion models have enabled continuous-state diffusion models to handle a variety of datasets, including categorical data. However, most methods rely on fixed pretrained embeddings, limiting the benefits of joint training with the diffusion model. While jointly learning the embedding (via reconstruction loss) and the latent diffusion model (via score matching loss) could enhance performance, our analysis shows that end-to-end training risks embedding collapse, degrading generation quality. To address this issue, we introduce CATDM, a continuous diffusion framework within the embedding space that stabilizes training. We propose a novel objective combining the joint embedding-diffusion variational lower bound with a Consistency-Matching (CM) regularizer, alongside a shifted cosine noise schedule and random dropping strategy. The CM regularizer ensures the recovery of the true data distribution. Experiments on benchmarks show that CATDM mitigates embedding collapse, yielding superior results on FFHQ, LSUN Churches, and LSUN Bedrooms. In particular, CATDM achieves an FID of 6.81 on ImageNet 256 256 with 50 steps. It outperforms non-autoregressive models in machine translation and is on a par with previous methods in text generation. These probabilistic models learn the inverse of a Markov chain that gradually converts data into pure Gaussian noise, using noise-conditioned score functions (i.e., gradients of log density), which are defined only for continuous data. The core concept is to progressively recover the original data distribution using a learned transition kernel. They offer stable and relatively efficient training procedures that contribute to their success. Recent advances, such as consistency models (Song et al., 2023; Kim et al., 2023; Luo et al., 2023), have further enhanced diffusion models by reducing the number of sampling steps, making them more practical for real-world applications.

artificial intelligence, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2410.14758

Country: North America > United States (0.28)

Genre: Research Report > New Finding (0.46)

Industry: Energy (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.88)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.34)

Add feedback

G2D2: Gradient-guided Discrete Diffusion for image inverse problem solving

Murata, Naoki, Lai, Chieh-Hsin, Takida, Yuhta, Uesaka, Toshimitsu, Nguyen, Bac, Ermon, Stefano, Mitsufuji, Yuki

arXiv.org Artificial IntelligenceOct-9-2024

Recent literature has effectively leveraged diffusion models trained on continuous variables as priors for solving inverse problems. Notably, discrete diffusion models with discrete latent codes have shown strong performance, particularly in modalities suited for discrete compressed representations, such as image and motion generation. However, their discrete and non-differentiable nature has limited their application to inverse problems formulated in continuous spaces. This paper presents a novel method for addressing linear inverse problems by leveraging image-generation models based on discrete diffusion as priors. We overcome these limitations by approximating the true posterior distribution with a variational distribution constructed from categorical distributions and continuous relaxation techniques. Furthermore, we employ a star-shaped noise process to mitigate the drawbacks of traditional discrete diffusion models with absorbing states, demonstrating that our method performs comparably to continuous diffusion techniques. To the best of our knowledge, this is the first approach to use discrete diffusion model-based priors for solving image inverse problems. These models operate by iteratively corrupting data then learning to reverse this corruption process, ultimately generating high-quality samples from noise. In parallel with continuous diffusion models, discrete diffusion models have emerged as a compelling alternative. Building on these advancements, researchers have made significant progress in expanding the application of diffusion models.

artificial intelligence, diffusion model, machine learning, (14 more...)

arXiv.org Artificial Intelligence

2410.1471

Genre: Research Report (1.00)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

Add feedback

Human-Feedback Efficient Reinforcement Learning for Online Diffusion Model Finetuning

Hiranaka, Ayano, Chen, Shang-Fu, Lai, Chieh-Hsin, Kim, Dongjun, Murata, Naoki, Shibuya, Takashi, Liao, Wei-Hsiang, Sun, Shao-Hua, Mitsufuji, Yuki

arXiv.org Artificial IntelligenceOct-7-2024

Controllable generation through Stable Diffusion (SD) fine-tuning aims to improve fidelity, safety, and alignment with human guidance. Existing reinforcement learning from human feedback methods usually rely on predefined heuristic reward functions or pretrained reward models built on large-scale datasets, limiting their applicability to scenarios where collecting such data is costly or difficult. To effectively and efficiently utilize human feedback, we develop a framework, HERO, which leverages online human feedback collected on the fly during model learning. Specifically, HERO features two key mechanisms: (1) Feedback-Aligned Representation Learning, an online training method that captures human feedback and provides informative learning signals for fine-tuning, and (2) Feedback-Guided Image Generation, which involves generating images from SD's refined initialization samples, enabling faster convergence towards the evaluator's intent. We demonstrate that HERO is 4x more efficient in online feedback for body part anomaly correction compared to the best existing method. Additionally, experiments show that HERO can effectively handle tasks like reasoning, counting, personalization, and reducing NSFW content with only 0.5K online feedback.

artificial intelligence, human feedback, machine learning, (17 more...)

arXiv.org Artificial Intelligence

2410.05116

Country: North America > United States > California (0.14)

Genre: Research Report (0.64)

Industry: Education > Educational Setting > Online (0.54)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.86)

Add feedback

Instruct-MusicGen: Unlocking Text-to-Music Editing for Music Language Models via Instruction Tuning

Zhang, Yixiao, Ikemiya, Yukara, Choi, Woosung, Murata, Naoki, Martínez-Ramírez, Marco A., Lin, Liwei, Xia, Gus, Liao, Wei-Hsiang, Mitsufuji, Yuki, Dixon, Simon

arXiv.org Artificial IntelligenceMay-29-2024

Recent advances in text-to-music editing, which employ text queries to modify music (e.g. by changing its style or adjusting instrumental components), present unique challenges and opportunities for AI-assisted music creation. Previous approaches in this domain have been constrained by the necessity to train specific editing models from scratch, which is both resource-intensive and inefficient; other research uses large language models to predict edited music, resulting in imprecise audio reconstruction. To Combine the strengths and address these limitations, we introduce Instruct-MusicGen, a novel approach that finetunes a pretrained MusicGen model to efficiently follow editing instructions such as adding, removing, or separating stems. Our approach involves a modification of the original MusicGen architecture by incorporating a text fusion module and an audio fusion module, which allow the model to process instruction texts and audio inputs concurrently and yield the desired edited music. Remarkably, Instruct-MusicGen only introduces 8% new parameters to the original MusicGen model and only trains for 5K steps, yet it achieves superior performance across all tasks compared to existing baselines, and demonstrates performance comparable to the models trained for specific tasks. This advancement not only enhances the efficiency of text-to-music editing but also broadens the applicability of music language models in dynamic music production environments.

artificial intelligence, large language model, natural language, (17 more...)

arXiv.org Artificial Intelligence

2405.18386

Country:

Europe (1.00)
North America > United States > California > San Francisco County > San Francisco (0.14)
North America > Canada > British Columbia > Metro Vancouver Regional District > Vancouver (0.14)

Genre:

Research Report > Promising Solution (0.34)
Overview > Innovation (0.34)

Industry:

Media > Music (1.00)
Leisure & Entertainment (1.00)

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.67)

Add feedback

PaGoDA: Progressive Growing of a One-Step Generator from a Low-Resolution Diffusion Teacher

Kim, Dongjun, Lai, Chieh-Hsin, Liao, Wei-Hsiang, Takida, Yuhta, Murata, Naoki, Uesaka, Toshimitsu, Mitsufuji, Yuki, Ermon, Stefano

arXiv.org Machine LearningMay-23-2024

To accelerate sampling, diffusion models (DMs) are often distilled into generators that directly map noise to data in a single step. In this approach, the resolution of the generator is fundamentally limited by that of the teacher DM. To overcome this limitation, we propose Progressive Growing of Diffusion Autoencoder (PaGoDA), a technique to progressively grow the resolution of the generator beyond that of the original teacher DM. Our key insight is that a pre-trained, low-resolution DM can be used to deterministically encode high-resolution data to a structured latent space by solving the PF-ODE forward in time (data-to-noise), starting from an appropriately down-sampled image. Using this frozen encoder in an auto-encoder framework, we train a decoder by progressively growing its resolution. From the nature of progressively growing decoder, PaGoDA avoids re-training teacher/student models when we upsample the student model, making the whole training pipeline much cheaper. In experiments, we used our progressively growing decoder to upsample from the pre-trained model's 64x64 resolution to generate 512x512 samples, achieving 2x faster inference compared to single-step distilled Stable Diffusion like LCM. PaGoDA also achieved state-of-the-art FIDs on ImageNet across all resolutions from 64x64 to 512x512. Additionally, we demonstrated PaGoDA's effectiveness in solving inverse problems and enabling controllable generation.

artificial intelligence, machine learning, pagoda, (17 more...)

arXiv.org Machine Learning

2405.14822

Country: Europe > Switzerland > Zürich > Zürich (0.14)

Genre: Research Report (0.81)

Industry: Education (0.88)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.87)

Add feedback

Understanding Multimodal Contrastive Learning Through Pointwise Mutual Information

Uesaka, Toshimitsu, Suzuki, Taiji, Takida, Yuhta, Lai, Chieh-Hsin, Murata, Naoki, Mitsufuji, Yuki

arXiv.org Artificial IntelligenceApr-29-2024

Multimodal representation learning to integrate different modalities, such as text, vision, and audio is important for real-world applications. The symmetric InfoNCE loss proposed in CLIP is a key concept in multimodal representation learning. In this work, we provide a theoretical understanding of the symmetric InfoNCE loss through the lens of the pointwise mutual information and show that encoders that achieve the optimal similarity in the pretraining provide a good representation for downstream classification tasks under mild assumptions. Based on our theoretical results, we also propose a new similarity metric for multimodal contrastive learning by utilizing a nonlinear kernel to enrich the capability. To verify the effectiveness of the proposed method, we demonstrate pretraining of multimodal representation models on the Conceptual Caption datasets and evaluate zero-shot classification and linear classification on common benchmark datasets. CLIP (Radford et al., 2021) and ALIGN (Jia et al., 2021) established one of the most common frameworks for multimodal representation learning (Guo et al., 2019). In this framework, to obtain the vision-language representation, two encoders that map inputs from different modalities onto a shared space are trained with a contrastive loss (Chopra et al., 2005). Recent studies have shown that a CLIP model pretrained on a large-scale text-image dataset provides transferable features to various downstream tasks such as linear classification (Radford et al., 2021; Jia et al., 2021), text-to-video retrieval (Lin et al., 2022), text-conditioned image generation (Ramesh et al., 2022), and manipulation (Patashnik et al., 2021). Recent works have shown that a CLIP model can be used to feed vision information to large language models (Alayrac et al., 2022). In addition to the text and vision modalities, this multimodal contrastive learning framework can be applied to other combinations of modalities such as text-audio representations (Elizalde et al., 2023).

artificial intelligence, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2404.19228

Country:

Asia (0.28)
Europe > Switzerland > Zürich > Zürich (0.14)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

Automated Black-box Prompt Engineering for Personalized Text-to-Image Generation

He, Yutong, Robey, Alexander, Murata, Naoki, Jiang, Yiding, Williams, Joshua, Pappas, George J., Hassani, Hamed, Mitsufuji, Yuki, Salakhutdinov, Ruslan, Kolter, J. Zico

arXiv.org Artificial IntelligenceMar-27-2024

Prompt engineering is effective for controlling the output of text-to-image (T2I) generative models, but it is also laborious due to the need for manually crafted prompts. This challenge has spurred the development of algorithms for automated prompt generation. However, these methods often struggle with transferability across T2I models, require white-box access to the underlying model, and produce non-intuitive prompts. In this work, we introduce PRISM, an algorithm that automatically identifies human-interpretable and transferable prompts that can effectively generate desired concepts given only black-box access to T2I models. Inspired by large language model (LLM) jailbreaking, PRISM leverages the in-context learning ability of LLMs to iteratively refine the candidate prompts distribution for given reference images. Our experiments demonstrate the versatility and effectiveness of PRISM in generating accurate prompts for objects, styles, and images across multiple T2I models, including Stable Diffusion, DALL-E, and Midjourney.

large language model, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

2403.19103

Country: North America > United States (0.45)

Genre: Research Report (1.00)

Industry:

Health & Medicine (0.92)
Leisure & Entertainment (0.67)
Transportation > Air (0.60)

Technology:

Information Technology > Sensing and Signal Processing > Image Processing (1.00)
Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning > Generative AI (0.50)

Add feedback

MusicMagus: Zero-Shot Text-to-Music Editing via Diffusion Models

Zhang, Yixiao, Ikemiya, Yukara, Xia, Gus, Murata, Naoki, Martínez, Marco, Liao, Wei-Hsiang, Mitsufuji, Yuki, Dixon, Simon

arXiv.org Artificial IntelligenceFeb-8-2024

Recent advances in text-to-music generation models have opened new avenues in musical creativity. However, music generation usually involves iterative refinements, and how to edit the generated music remains a significant challenge. This paper introduces a novel approach to the editing of music generated by such models, enabling the modification of specific attributes, such as genre, mood and instrument, while maintaining other aspects unchanged. Our method transforms text editing to \textit{latent space manipulation} while adding an extra constraint to enforce consistency. It seamlessly integrates with existing pretrained text-to-music diffusion models without requiring additional training. Experimental results demonstrate superior performance over both zero-shot and certain supervised baselines in style and timbre transfer evaluations. Additionally, we showcase the practical applicability of our approach in real-world music editing scenarios.

large language model, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2402.06178

Country: North America > Canada (0.14)

Genre: Research Report > Promising Solution (0.34)

Industry:

Media > Music (1.00)
Leisure & Entertainment (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback