AITopics | Takida, Yuhta

Collaborating Authors

Takida, Yuhta

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Training Consistency Models with Variational Noise Coupling

Silvestri, Gianluigi, Ambrogioni, Luca, Lai, Chieh-Hsin, Takida, Yuhta, Mitsufuji, Yuki

arXiv.org Artificial IntelligenceFeb-25-2025

Consistency Training (CT) has recently emerged as a promising alternative to diffusion models, achieving competitive performance in image generation tasks. However, non-distillation consistency training often suffers from high variance and instability, and analyzing and improving its training dynamics is an active area of research. In this work, we propose a novel CT training approach based on the Flow Matching framework. Our main contribution is a trained noise-coupling scheme inspired by the architecture of Variational Autoencoders (VAE). By training a data-dependent noise emission model implemented as an encoder architecture, our method can indirectly learn the geometry of the noise-to-data mapping, which is instead fixed by the choice of the forward process in classical CT. Empirical results across diverse image datasets show significant generative improvements, with our model outperforming baselines and achieving the state-of-the-art (SoTA) non-distillation CT FID on CIFAR-10, and attaining FID on par with SoTA on ImageNet at $64 \times 64$ resolution in 2-step generation. Our code is available at https://github.com/sony/vct .

artificial intelligence, coupling, machine learning, (12 more...)

arXiv.org Artificial Intelligence

2502.18197

Country:

Europe > Netherlands (0.14)
Asia > Japan (0.14)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

Add feedback

Transformed Low-rank Adaptation via Tensor Decomposition and Its Applications to Text-to-image Models

Tao, Zerui, Takida, Yuhta, Murata, Naoki, Zhao, Qibin, Mitsufuji, Yuki

arXiv.org Artificial IntelligenceJan-15-2025

Parameter-Efficient Fine-Tuning (PEFT) of text-to-image models has become an increasingly popular technique with many applications. Among the various PEFT methods, Low-Rank Adaptation (LoRA) and its variants have gained significant attention due to their effectiveness, enabling users to fine-tune models with limited computational resources. However, the approximation gap between the low-rank assumption and desired fine-tuning weights prevents the simultaneous acquisition of ultra-parameter-efficiency and better performance. To reduce this gap and further improve the power of LoRA, we propose a new PEFT method that combines two classes of adaptations, namely, transform and residual adaptations. In specific, we first apply a full-rank and dense transform to the pre-trained weight. This learnable transform is expected to align the pre-trained weight as closely as possible to the desired weight, thereby reducing the rank of the residual weight. Then, the residual part can be effectively approximated by more compact and parameter-efficient structures, with a smaller approximation error. To achieve ultra-parameter-efficiency in practice, we design highly flexible and effective tensor decompositions for both the transform and residual adaptations. Additionally, popular PEFT methods such as DoRA can be summarized under this transform plus residual adaptation scheme. Experiments are conducted on fine-tuning Stable Diffusion models in subject-driven and controllable generation. The results manifest that our method can achieve better performances and parameter efficiency compared to LoRA and several baselines.

artificial intelligence, deep learning, machine learning, (18 more...)

arXiv.org Artificial Intelligence

2501.08727

Genre: Research Report > New Finding (0.46)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.88)

Add feedback

TraSCE: Trajectory Steering for Concept Erasure

Jain, Anubhav, Kobayashi, Yuya, Shibuya, Takashi, Takida, Yuhta, Memon, Nasir, Togelius, Julian, Mitsufuji, Yuki

arXiv.org Artificial IntelligenceDec-10-2024

Recent advancements in text-to-image diffusion models have brought them to the public spotlight, becoming widely accessible and embraced by everyday users. However, these models have been shown to generate harmful content such as not-safe-for-work (NSFW) images. While approaches have been proposed to erase such abstract concepts from the models, jail-breaking techniques have succeeded in bypassing such safety measures. In this paper, we propose TraSCE, an approach to guide the diffusion trajectory away from generating harmful content. Our approach is based on negative prompting, but as we show in this paper, conventional negative prompting is not a complete solution and can easily be bypassed in some corner cases. To address this issue, we first propose a modification of conventional negative prompting. Furthermore, we introduce a localized loss-based guidance that enhances the modified negative prompting technique by steering the diffusion trajectory. We demonstrate that our proposed method achieves state-of-the-art results on various benchmarks in removing harmful content including ones proposed by red teams; and erasing artistic styles and objects. Our proposed approach does not require any training, weight modifications, or training data (both image or prompt), making it easier for model owners to erase new concepts.

artificial intelligence, machine learning, reviewer, (13 more...)

arXiv.org Artificial Intelligence

2412.07658

Genre: Research Report (0.40)

Technology: Information Technology > Artificial Intelligence > Machine Learning (0.87)

Add feedback

Classifier-Free Guidance inside the Attraction Basin May Cause Memorization

Jain, Anubhav, Kobayashi, Yuya, Shibuya, Takashi, Takida, Yuhta, Memon, Nasir, Togelius, Julian, Mitsufuji, Yuki

arXiv.org Artificial IntelligenceNov-23-2024

Diffusion models are prone to exactly reproduce images from the training data. This exact reproduction of the training data is concerning as it can lead to copyright infringement and/or leakage of privacy-sensitive information. In this paper, we present a novel way to understand the memorization phenomenon, and propose a simple yet effective approach to mitigate it. We argue that memorization occurs because of an attraction basin in the denoising process which steers the diffusion trajectory towards a memorized image. However, this can be mitigated by guiding the diffusion trajectory away from the attraction basin by not applying classifier-free guidance until an ideal transition point occurs from which classifier-free guidance is applied. This leads to the generation of non-memorized images that are high in image quality and well-aligned with the conditioning mechanism. To further improve on this, we present a new guidance technique, \emph{opposite guidance}, that escapes the attraction basin sooner in the denoising process. We demonstrate the existence of attraction basins in various scenarios in which memorization occurs, and we show that our proposed approach successfully mitigates memorization.

artificial intelligence, machine learning, memorization, (17 more...)

arXiv.org Artificial Intelligence

2411.16738

Country: North America > United States (0.28)

Genre: Research Report (0.40)

Industry:

Information Technology > Security & Privacy (0.54)
Transportation (0.46)
Media (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Memory-Based Learning > Rote Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning > Generative AI (0.46)

Add feedback

Music Foundation Model as Generic Booster for Music Downstream Tasks

Liao, WeiHsiang, Takida, Yuhta, Ikemiya, Yukara, Zhong, Zhi, Lai, Chieh-Hsin, Fabbro, Giorgio, Shimada, Kazuki, Toyama, Keisuke, Cheuk, Kinwai, Martínez-Ramírez, Marco A., Takahashi, Shusuke, Uhlich, Stefan, Akama, Taketo, Choi, Woosung, Koyama, Yuichiro, Mitsufuji, Yuki

arXiv.org Artificial IntelligenceNov-5-2024

We demonstrate the efficacy of using intermediate representations from a single foundation model to enhance various music downstream tasks. We introduce SoniDo, a music foundation model (MFM) designed to extract hierarchical features from target music samples. By leveraging hierarchical intermediate features, SoniDo constrains the information granularity, leading to improved performance across various downstream tasks including both understanding and generative tasks. We specifically evaluated this approach on representative tasks such as music tagging, music transcription, music source separation, and music mixing. Our results reveal that the features extracted from foundation models provide valuable enhancements in training downstream task models. This highlights the capability of using features extracted from music foundation models as a booster for downstream tasks. Our approach not only benefits existing task-specific models but also supports music downstream tasks constrained by data scarcity. This paves the way for more effective and accessible music processing solutions. Figure 1: SoniDo extracts hierarchical features of target music samples, which are useful for solving music downstream tasks including understanding and generative tasks.

artificial intelligence, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2411.01135

Country:

Europe (1.00)
Asia > Japan > Honshū (0.28)

Genre: Research Report > New Finding (0.48)

Industry:

Media > Music (1.00)
Leisure & Entertainment (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
(2 more...)

Add feedback

Mitigating Embedding Collapse in Diffusion Models for Categorical Data

Nguyen, Bac, Lai, and Chieh-Hsin, Takida, Yuhta, Murata, Naoki, Uesaka, Toshimitsu, Ermon, Stefano, Mitsufuji, Yuki

arXiv.org Artificial IntelligenceOct-18-2024

Latent diffusion models have enabled continuous-state diffusion models to handle a variety of datasets, including categorical data. However, most methods rely on fixed pretrained embeddings, limiting the benefits of joint training with the diffusion model. While jointly learning the embedding (via reconstruction loss) and the latent diffusion model (via score matching loss) could enhance performance, our analysis shows that end-to-end training risks embedding collapse, degrading generation quality. To address this issue, we introduce CATDM, a continuous diffusion framework within the embedding space that stabilizes training. We propose a novel objective combining the joint embedding-diffusion variational lower bound with a Consistency-Matching (CM) regularizer, alongside a shifted cosine noise schedule and random dropping strategy. The CM regularizer ensures the recovery of the true data distribution. Experiments on benchmarks show that CATDM mitigates embedding collapse, yielding superior results on FFHQ, LSUN Churches, and LSUN Bedrooms. In particular, CATDM achieves an FID of 6.81 on ImageNet 256 256 with 50 steps. It outperforms non-autoregressive models in machine translation and is on a par with previous methods in text generation. These probabilistic models learn the inverse of a Markov chain that gradually converts data into pure Gaussian noise, using noise-conditioned score functions (i.e., gradients of log density), which are defined only for continuous data. The core concept is to progressively recover the original data distribution using a learned transition kernel. They offer stable and relatively efficient training procedures that contribute to their success. Recent advances, such as consistency models (Song et al., 2023; Kim et al., 2023; Luo et al., 2023), have further enhanced diffusion models by reducing the number of sampling steps, making them more practical for real-world applications.

artificial intelligence, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2410.14758

Country: North America > United States (0.28)

Genre: Research Report > New Finding (0.46)

Industry: Energy (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.88)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.34)

Add feedback

VRVQ: Variable Bitrate Residual Vector Quantization for Audio Compression

Chae, Yunkee, Choi, Woosung, Takida, Yuhta, Koo, Junghyun, Ikemiya, Yukara, Zhong, Zhi, Cheuk, Kin Wai, Martínez-Ramírez, Marco A., Lee, Kyogu, Liao, Wei-Hsiang, Mitsufuji, Yuki

arXiv.org Artificial IntelligenceOct-12-2024

Recent state-of-the-art neural audio compression models have progressively adopted residual vector quantization (RVQ). Despite this success, these models employ a fixed number of codebooks per frame, which can be suboptimal in terms of rate-distortion tradeoff, particularly in scenarios with simple input audio, such as silence. To address this limitation, we propose variable bitrate RVQ (VRVQ) for audio codecs, which allows for more efficient coding by adapting the number of codebooks used per frame. Furthermore, we propose a gradient estimation method for the non-differentiable masking operation that transforms from the importance map to the binary importance mask, improving model training via a straight-through estimator. We demonstrate that the proposed training framework achieves superior results compared to the baseline method and shows further improvement when applied to the current state-of-the-art codec.

artificial intelligence, importance map, machine learning, (17 more...)

arXiv.org Artificial Intelligence

2410.06016

Country: Asia > Japan (0.14)

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

Add feedback

Distillation of Discrete Diffusion through Dimensional Correlations

Hayakawa, Satoshi, Takida, Yuhta, Imaizumi, Masaaki, Wakaki, Hiromi, Mitsufuji, Yuki

arXiv.org Machine LearningOct-11-2024

Diffusion models have demonstrated exceptional performances in various fields of generative modeling. While they often outperform competitors including VAEs and GANs in sample quality and diversity, they suffer from slow sampling speed due to their iterative nature. Recently, distillation techniques and consistency models are mitigating this issue in continuous domains, but discrete diffusion models have some specific challenges towards faster generation. Most notably, in the current literature, correlations between different dimensions (pixels, locations) are ignored, both by its modeling and loss functions, due to computational limitations. In this paper, we propose "mixture" models in discrete diffusion that are capable of treating dimensional correlations while remaining scalable, and we provide a set of loss functions for distilling the iterations of existing models. Two primary theoretical insights underpin our approach: first, that dimensionally independent models can well approximate the data distribution if they are allowed to conduct many sampling steps, and second, that our loss functions enables mixture models to distill such many-step conventional models into just a few steps by learning the dimensional correlations. We empirically demonstrate that our proposed method for discrete diffusions work in practice, by distilling a continuous-time discrete diffusion model pretrained on the CIFAR-10 dataset.

artificial intelligence, dimensional correlation, discrete diffusion, (1 more...)

arXiv.org Machine Learning

2410.08709

Genre: Research Report (0.69)

Technology: Information Technology > Artificial Intelligence (0.93)

Add feedback

$\textit{Jump Your Steps}$: Optimizing Sampling Schedule of Discrete Diffusion Models

Park, Yong-Hyun, Lai, Chieh-Hsin, Hayakawa, Satoshi, Takida, Yuhta, Mitsufuji, Yuki

arXiv.org Artificial IntelligenceOct-10-2024

Diffusion models have seen notable success in continuous domains, leading to the development of discrete diffusion models (DDMs) for discrete variables. While parallel sampling methods like τ-leaping accelerate this process, they introduce Compounding Decoding Error (CDE), where discrepancies arise between the true distribution and the approximation from parallel token generation, leading to degraded sample quality. More precisely, we derive a practical upper bound on CDE and propose an efficient algorithm for searching for the optimal sampling schedule. Extensive experiments across image, music, and text generation show that JYS significantly improves sampling quality, establishing it as a versatile framework for enhancing DDM performance for fast sampling. Diffusion models (Sohl-Dickstein et al., 2015; Song et al., 2021b; Ho et al., 2020; Song et al., 2021a; Karras et al., 2022) have achieved remarkable success in generation tasks within the continuous domain. However, certain modalities, such as text and music, inherently possess discrete features. Nevertheless, like their continuous counterparts, DDMs encounter a significant bottleneck in sampling speed due to their progressive refinement process. In contrast to continuous-domain diffusion models, where sampling dynamics are driven by samplewise differential equations (Song et al., 2021b), allowing for the direct application of wellestablished numerical methods to accelerate generation, enhancing speed in DDMs poses a significant challenge. To address this, researchers have proposed fast and efficient samplers, including notable methods such as the τ-leaping (Campbell et al., 2022; Lezama et al., 2022; Sun et al., 2023) and k-Gillespie algorithms (Zhao et al., 2024), which facilitate parallel sampling of multiple tokens in a single step. However, this parallel but independent sampling introduces Compounding Decoding Error (CDE) (Lezama et al., 2022), which arises from a mismatch between the training and inference distributions of intermediate latents during parallel sampling. Specifically, while each token is generated according to its marginal distribution, the joint distribution deviates from the learned distribution. To mitigate this issue, the predictor-corrector (PC) sampler (Campbell et al., 2022) has been proposed.

artificial intelligence, diffusion model, machine learning, (15 more...)

arXiv.org Artificial Intelligence

2410.07761

Country:

North America > United States (1.00)
Asia (0.93)

Genre: Research Report > New Finding (0.46)

Industry:

Government > Voting & Elections (1.00)
Government > Regional Government > North America Government > United States Government (1.00)
Health & Medicine > Health Care Providers & Services (0.67)
(4 more...)

Technology: Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

G2D2: Gradient-guided Discrete Diffusion for image inverse problem solving

Murata, Naoki, Lai, Chieh-Hsin, Takida, Yuhta, Uesaka, Toshimitsu, Nguyen, Bac, Ermon, Stefano, Mitsufuji, Yuki

arXiv.org Artificial IntelligenceOct-9-2024

Recent literature has effectively leveraged diffusion models trained on continuous variables as priors for solving inverse problems. Notably, discrete diffusion models with discrete latent codes have shown strong performance, particularly in modalities suited for discrete compressed representations, such as image and motion generation. However, their discrete and non-differentiable nature has limited their application to inverse problems formulated in continuous spaces. This paper presents a novel method for addressing linear inverse problems by leveraging image-generation models based on discrete diffusion as priors. We overcome these limitations by approximating the true posterior distribution with a variational distribution constructed from categorical distributions and continuous relaxation techniques. Furthermore, we employ a star-shaped noise process to mitigate the drawbacks of traditional discrete diffusion models with absorbing states, demonstrating that our method performs comparably to continuous diffusion techniques. To the best of our knowledge, this is the first approach to use discrete diffusion model-based priors for solving image inverse problems. These models operate by iteratively corrupting data then learning to reverse this corruption process, ultimately generating high-quality samples from noise. In parallel with continuous diffusion models, discrete diffusion models have emerged as a compelling alternative. Building on these advancements, researchers have made significant progress in expanding the application of diffusion models.

artificial intelligence, diffusion model, machine learning, (14 more...)

arXiv.org Artificial Intelligence

2410.1471

Genre: Research Report (1.00)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

Add feedback