AITopics | Ishii, Masato

Collaborating Authors

Ishii, Masato

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Taming Multimodal Joint Training for High-Quality Video-to-Audio Synthesis

Cheng, Ho Kei, Ishii, Masato, Hayakawa, Akio, Shibuya, Takashi, Schwing, Alexander, Mitsufuji, Yuki

arXiv.org Artificial IntelligenceDec-19-2024

We propose to synthesize high-quality and synchronized audio, given video and optional text conditions, using a novel multimodal joint training framework MMAudio. In contrast to single-modality training conditioned on (limited) video data only, MMAudio is jointly trained with larger-scale, readily available text-audio data to learn to generate semantically aligned high-quality audio samples. Additionally, we improve audio-visual synchrony with a conditional synchronization module that aligns video conditions with audio latents at the frame level. Trained with a flow matching objective, MMAudio achieves new video-to-audio state-of-the-art among public models in terms of audio quality, semantic alignment, and audio-visual synchronization, while having a low inference time (1.23s to generate an 8s clip) and just 157M parameters. MMAudio also achieves surprisingly competitive performance in text-to-audio generation, showing that joint training does not hinder single-modality performance. Code and demo are available at: https://hkchengrex.github.io/MMAudio

artificial intelligence, machine learning, natural language, (16 more...)

arXiv.org Artificial Intelligence

2412.15322

Genre:

Research Report (0.64)
Questionnaire & Opinion Survey (0.47)

Industry: Leisure & Entertainment (0.68)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.46)

Add feedback

A Simple but Strong Baseline for Sounding Video Generation: Effective Adaptation of Audio and Video Diffusion Models for Joint Generation

Ishii, Masato, Hayakawa, Akio, Shibuya, Takashi, Mitsufuji, Yuki

arXiv.org Artificial IntelligenceNov-15-2024

In this work, we build a simple but strong baseline for sounding video generation. Given base diffusion models for audio and video, we integrate them with additional modules into a single model and train it to make the model jointly generate audio and video. To enhance alignment between audio-video pairs, we introduce two novel mechanisms in our model. The first one is timestep adjustment, which provides different timestep information to each base model. It is designed to align how samples are generated along with timesteps across modalities. The second one is a new design of the additional modules, termed Cross-Modal Conditioning as Positional Encoding (CMC-PE). In CMC-PE, cross-modal information is embedded as if it represents temporal position information, and the embeddings are fed into the model like positional encoding. Compared with the popular crossattention mechanism, CMC-PE provides a better inductive bias for temporal alignment in the generated data. Experimental results validate the effectiveness of the two newly introduced mechanisms and also demonstrate that our method outperforms existing methods. Diffusion models have made great strides in the last few years in various generation tasks across modalities including image, video, and audio (Yang et al., 2023).

artificial intelligence, diffusion model, machine learning, (18 more...)

arXiv.org Artificial Intelligence

2409.1755

Genre: Research Report (0.64)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.93)

Add feedback

Mining Your Own Secrets: Diffusion Classifier Scores for Continual Personalization of Text-to-Image Diffusion Models

Jha, Saurav, Yang, Shiqi, Ishii, Masato, Zhao, Mengjie, Simon, Christian, Mirza, Muhammad Jehanzeb, Gong, Dong, Yao, Lina, Takahashi, Shusuke, Mitsufuji, Yuki

arXiv.org Artificial IntelligenceOct-2-2024

Personalized text-to-image diffusion models have grown popular for their ability to efficiently acquire a new concept from user-defined text descriptions and a few images. However, in the real world, a user may wish to personalize a model on multiple concepts but one at a time, with no access to the data from previous concepts due to storage/privacy concerns. When faced with this continual learning (CL) setup, most personalization methods fail to find a balance between acquiring new concepts and retaining previous ones -- a challenge that continual personalization (CP) aims to solve. Inspired by the successful CL methods that rely on class-specific information for regularization, we resort to the inherent class-conditioned density estimates, also known as diffusion classifier (DC) scores, for continual personalization of text-to-image diffusion models. Namely, we propose using DC scores for regularizing the parameter-space and function-space of text-to-image diffusion models, to achieve continual personalization. Using several diverse evaluation setups, datasets, and metrics, we show that our proposed regularization-based CP methods outperform the state-of-the-art C-LoRA, and other baselines. Finally, by operating in the replay-free CL setup and on low-rank adapters, our method incurs zero storage and parameter overhead, respectively, over the state-of-the-art.

artificial intelligence, dc score, machine learning, (19 more...)

arXiv.org Artificial Intelligence

2410.007

Country:

Asia (0.67)
North America > United States (0.14)
Europe > Germany (0.14)
Europe > Austria (0.14)

Genre: Research Report (0.63)

Industry: Information Technology > Security & Privacy (0.34)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.67)

Add feedback

Discriminator-Guided Cooperative Diffusion for Joint Audio and Video Generation

Hayakawa, Akio, Ishii, Masato, Shibuya, Takashi, Mitsufuji, Yuki

arXiv.org Artificial IntelligenceMay-28-2024

In this study, we aim to construct an audio-video generative model with minimal computational cost by leveraging pre-trained single-modal generative models for audio and video. To achieve this, we propose a novel method that guides each single-modal model to cooperatively generate well-aligned samples across modalities. Specifically, given two pre-trained base diffusion models, we train a lightweight joint guidance module to adjust scores separately estimated by the base models to match the score of joint distribution over audio and video. We theoretically show that this guidance can be computed through the gradient of the optimal discriminator distinguishing real audio-video pairs from fake ones independently generated by the base models. On the basis of this analysis, we construct the joint guidance module by training this discriminator. Additionally, we adopt a loss function to make the gradient of the discriminator work as a noise estimator, as in standard diffusion models, stabilizing the gradient of the discriminator. Empirical evaluations on several benchmark datasets demonstrate that our method improves both single-modal fidelity and multi-modal alignment with a relatively small number of parameters.

artificial intelligence, base model, machine learning, (15 more...)

arXiv.org Artificial Intelligence

2405.17842

Genre: Research Report > New Finding (0.66)

Technology:

Information Technology > Artificial Intelligence > Vision (0.94)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.93)
Information Technology > Artificial Intelligence > Representation & Reasoning (0.93)

Add feedback

Visual Echoes: A Simple Unified Transformer for Audio-Visual Generation

Yang, Shiqi, Zhong, Zhi, Zhao, Mengjie, Takahashi, Shusuke, Ishii, Masato, Shibuya, Takashi, Mitsufuji, Yuki

arXiv.org Artificial IntelligenceMay-24-2024

In recent years, with the realistic generation results and a wide range of personalized applications, diffusion-based generative models gain huge attention in both visual and audio generation areas. Compared to the considerable advancements of text2image or text2audio generation, research in audio2visual or visual2audio generation has been relatively slow. The recent audio-visual generation methods usually resort to huge large language model or composable diffusion models. Instead of designing another giant model for audio-visual generation, in this paper we take a step back showing a simple and lightweight generative transformer, which is not fully investigated in multi-modal generation, can achieve excellent results on image2audio generation. The transformer operates in the discrete audio and visual Vector-Quantized GAN space, and is trained in the mask denoising manner. After training, the classifier-free guidance could be deployed off-the-shelf achieving better performance, without any extra training or modification. Since the transformer model is modality symmetrical, it could also be directly deployed for audio2image generation and co-generation. In the experiments, we show that our simple method surpasses recent image2audio generation methods. Generated audio samples could be found in this link.

large language model, machine learning, natural language, (16 more...)

arXiv.org Artificial Intelligence

2405.14598

Country:

Asia > Japan (0.14)
Europe > United Kingdom (0.14)

Genre: Research Report (0.64)

Industry: Information Technology > Security & Privacy (0.67)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.67)

Add feedback

Fine-grained Image Editing by Pixel-wise Guidance Using Diffusion Models

Matsunaga, Naoki, Ishii, Masato, Hayakawa, Akio, Suzuki, Kenji, Narihira, Takuya

arXiv.org Artificial IntelligenceMay-31-2023

Our goal is to develop fine-grained real-image editing methods suitable for real-world applications. In this paper, we first summarize four requirements for these methods and propose a novel diffusion-based image editing framework with pixel-wise guidance that satisfies these requirements. Specifically, we train pixel-classifiers with a few annotated data and then infer the segmentation map of a target image. Users then manipulate the map to instruct how the image will be edited. We utilize a pre-trained diffusion model to generate edited images aligned with the user's intention with pixel-wise guidance. The effective combination of proposed guidance and other techniques enables highly controllable editing with preserving the outside of the edited area, which results in meeting our requirements. The experimental results demonstrate that our proposal outperforms the GAN-based method for editing quality and speed.

artificial intelligence, editing, machine learning, (18 more...)

arXiv.org Artificial Intelligence

2212.02024

Country: Asia > Japan (0.14)

Genre: Research Report > New Finding (0.88)

Industry: Media > Photography (0.82)

Technology:

Information Technology > Sensing and Signal Processing > Image Processing (1.00)
Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)

Add feedback

Zero-shot Domain Adaptation Based on Attribute Information

Ishii, Masato, Takenouchi, Takashi, Sugiyama, Masashi

arXiv.org Machine LearningMar-13-2019

In many algorithms for supervised learning, it is assumed that training data are obtained from the same distribution as that of test data [1]. Unfortunately, this assumption is often violated in practical applications. For example, Figure 1 shows images of two different surveillance videos that are obtained from Video Surveillance Online Repository [2]. Suppose we want to recognize vehicles from these videos. Since the position and pose of the camera are different, the appearance of the vehicle is somewhat different between two videos. Due to this difference, even if we train a highly accurate classifier on video A, it may work poorly on video B. Such discrepancy has recently become a major problem in pattern recognition, because it is often difficult to obtain training data that are sufficiently similar to the test data. To deal with this problem, domain adaptation techniques have been proposed.

artificial intelligence, machine learning, target data, (19 more...)

arXiv.org Machine Learning

1903.05312

Genre: Research Report (0.82)

Industry: Commercial Services & Supplies > Security & Alarm Services (0.34)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)

Add feedback