AITopics | Takahashi, Shusuke

Collaborating Authors

Takahashi, Shusuke

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

OpenMU: Your Swiss Army Knife for Music Understanding

Zhao, Mengjie, Zhong, Zhi, Mao, Zhuoyuan, Yang, Shiqi, Liao, Wei-Hsiang, Takahashi, Shusuke, Wakaki, Hiromi, Mitsufuji, Yuki

arXiv.org Artificial IntelligenceNov-27-2024

We present OpenMU-Bench, a large-scale benchmark suite for addressing the data scarcity issue in training multimodal language models to understand music. To construct OpenMU-Bench, we leveraged existing datasets and bootstrapped new annotations. OpenMU-Bench also broadens the scope of music understanding by including lyrics understanding and music tool usage. Using OpenMU-Bench, we trained our music understanding model, OpenMU, with extensive ablations, demonstrating that OpenMU outperforms baseline models such as MU-Llama. Both OpenMU and OpenMU-Bench are open-sourced to facilitate future research in music understanding and to enhance creative music production efficiency.

large language model, machine learning, natural language, (16 more...)

arXiv.org Artificial Intelligence

2410.15573

Country:

Europe (0.67)
North America > United States > Pennsylvania (0.14)
North America > United States > Michigan (0.14)
North America > Mexico > Mexico City (0.14)

Genre: Research Report (0.50)

Industry:

Media > Music (1.00)
Leisure & Entertainment (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.72)

Add feedback

Music Foundation Model as Generic Booster for Music Downstream Tasks

Liao, WeiHsiang, Takida, Yuhta, Ikemiya, Yukara, Zhong, Zhi, Lai, Chieh-Hsin, Fabbro, Giorgio, Shimada, Kazuki, Toyama, Keisuke, Cheuk, Kinwai, Martínez-Ramírez, Marco A., Takahashi, Shusuke, Uhlich, Stefan, Akama, Taketo, Choi, Woosung, Koyama, Yuichiro, Mitsufuji, Yuki

arXiv.org Artificial IntelligenceNov-5-2024

We demonstrate the efficacy of using intermediate representations from a single foundation model to enhance various music downstream tasks. We introduce SoniDo, a music foundation model (MFM) designed to extract hierarchical features from target music samples. By leveraging hierarchical intermediate features, SoniDo constrains the information granularity, leading to improved performance across various downstream tasks including both understanding and generative tasks. We specifically evaluated this approach on representative tasks such as music tagging, music transcription, music source separation, and music mixing. Our results reveal that the features extracted from foundation models provide valuable enhancements in training downstream task models. This highlights the capability of using features extracted from music foundation models as a booster for downstream tasks. Our approach not only benefits existing task-specific models but also supports music downstream tasks constrained by data scarcity. This paves the way for more effective and accessible music processing solutions. Figure 1: SoniDo extracts hierarchical features of target music samples, which are useful for solving music downstream tasks including understanding and generative tasks.

artificial intelligence, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2411.01135

Country:

Europe (1.00)
Asia > Japan > Honshū (0.28)

Genre: Research Report > New Finding (0.48)

Industry:

Media > Music (1.00)
Leisure & Entertainment (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
(2 more...)

Add feedback

Mining Your Own Secrets: Diffusion Classifier Scores for Continual Personalization of Text-to-Image Diffusion Models

Jha, Saurav, Yang, Shiqi, Ishii, Masato, Zhao, Mengjie, Simon, Christian, Mirza, Muhammad Jehanzeb, Gong, Dong, Yao, Lina, Takahashi, Shusuke, Mitsufuji, Yuki

arXiv.org Artificial IntelligenceOct-2-2024

Personalized text-to-image diffusion models have grown popular for their ability to efficiently acquire a new concept from user-defined text descriptions and a few images. However, in the real world, a user may wish to personalize a model on multiple concepts but one at a time, with no access to the data from previous concepts due to storage/privacy concerns. When faced with this continual learning (CL) setup, most personalization methods fail to find a balance between acquiring new concepts and retaining previous ones -- a challenge that continual personalization (CP) aims to solve. Inspired by the successful CL methods that rely on class-specific information for regularization, we resort to the inherent class-conditioned density estimates, also known as diffusion classifier (DC) scores, for continual personalization of text-to-image diffusion models. Namely, we propose using DC scores for regularizing the parameter-space and function-space of text-to-image diffusion models, to achieve continual personalization. Using several diverse evaluation setups, datasets, and metrics, we show that our proposed regularization-based CP methods outperform the state-of-the-art C-LoRA, and other baselines. Finally, by operating in the replay-free CL setup and on low-rank adapters, our method incurs zero storage and parameter overhead, respectively, over the state-of-the-art.

artificial intelligence, dc score, machine learning, (19 more...)

arXiv.org Artificial Intelligence

2410.007

Country:

Asia (0.67)
North America > United States (0.14)
Europe > Germany (0.14)
Europe > Austria (0.14)

Genre: Research Report (0.63)

Industry: Information Technology > Security & Privacy (0.34)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.67)

Add feedback

Visual Echoes: A Simple Unified Transformer for Audio-Visual Generation

Yang, Shiqi, Zhong, Zhi, Zhao, Mengjie, Takahashi, Shusuke, Ishii, Masato, Shibuya, Takashi, Mitsufuji, Yuki

arXiv.org Artificial IntelligenceMay-24-2024

In recent years, with the realistic generation results and a wide range of personalized applications, diffusion-based generative models gain huge attention in both visual and audio generation areas. Compared to the considerable advancements of text2image or text2audio generation, research in audio2visual or visual2audio generation has been relatively slow. The recent audio-visual generation methods usually resort to huge large language model or composable diffusion models. Instead of designing another giant model for audio-visual generation, in this paper we take a step back showing a simple and lightweight generative transformer, which is not fully investigated in multi-modal generation, can achieve excellent results on image2audio generation. The transformer operates in the discrete audio and visual Vector-Quantized GAN space, and is trained in the mask denoising manner. After training, the classifier-free guidance could be deployed off-the-shelf achieving better performance, without any extra training or modification. Since the transformer model is modality symmetrical, it could also be directly deployed for audio2image generation and co-generation. In the experiments, we show that our simple method surpasses recent image2audio generation methods. Generated audio samples could be found in this link.

large language model, machine learning, natural language, (16 more...)

arXiv.org Artificial Intelligence

2405.14598

Country:

Asia > Japan (0.14)
Europe > United Kingdom (0.14)

Genre: Research Report (0.64)

Industry: Information Technology > Security & Privacy (0.67)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.67)

Add feedback

Manifold-Aware Deep Clustering: Maximizing Angles between Embedding Vectors Based on Regular Simplex

Tanaka, Keitaro, Sawata, Ryosuke, Takahashi, Shusuke

arXiv.org Artificial IntelligenceOct-16-2023

This paper presents a new deep clustering (DC) method called manifold-aware DC (M-DC) that can enhance hyperspace utilization more effectively than the original DC. The original DC has a limitation in that a pair of two speakers has to be embedded having an orthogonal relationship due to its use of the one-hot vector-based loss function, while our method derives a unique loss function aimed at maximizing the target angle in the hyperspace based on the nature of a regular simplex. Our proposed loss imposes a higher penalty than the original DC when the speaker is assigned incorrectly. The change from DC to M-DC can be easily achieved by rewriting just one term in the loss function of DC, without any other modifications to the network architecture or model parameters. As such, our method has high practicability because it does not affect the original inference part. The experimental results show that the proposed method improves the performances of the original DC and its expansion method.

artificial intelligence, deep learning, machine learning, (14 more...)

arXiv.org Artificial Intelligence

doi: 10.21437/Interspeech.2021-1029

2106.02331

Country:

Asia > Japan (0.14)
North America > United States (0.14)
North America > Canada > Ontario > Toronto (0.14)

Genre: Research Report > New Finding (0.48)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Speech (0.70)

Add feedback

Diffiner: A Versatile Diffusion-based Generative Refiner for Speech Enhancement

Sawata, Ryosuke, Murata, Naoki, Takida, Yuhta, Uesaka, Toshimitsu, Shibuya, Takashi, Takahashi, Shusuke, Mitsufuji, Yuki

arXiv.org Artificial IntelligenceAug-30-2023

Although deep neural network (DNN)-based speech enhancement (SE) methods outperform the previous non-DNN-based ones, they often degrade the perceptual quality of generated outputs. To tackle this problem, we introduce a DNN-based generative refiner, Diffiner, aiming to improve perceptual speech quality pre-processed by an SE method. We train a diffusion-based generative model by utilizing a dataset consisting of clean speech only. Then, our refiner effectively mixes clean parts newly generated via denoising diffusion restoration into the degraded and distorted parts caused by a preceding SE method, resulting in refined speech. Once our refiner is trained on a set of clean speech, it can be applied to various SE methods without additional training specialized for each SE module. Therefore, our refiner can be a versatile post-processing module w.r.t. SE methods and has high potential in terms of modularity. Experimental results show that our method improved perceptual speech quality regardless of the preceding SE methods used.

artificial intelligence, machine learning, versatile diffusion-based generative refiner, (2 more...)

arXiv.org Artificial Intelligence

doi: 10.21437/Interspeech.2023-1547

2210.17287

Genre: Research Report (1.00)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.53)

Add feedback

Diffusion-Based Speech Enhancement with Joint Generative and Predictive Decoders

Shi, Hao, Shimada, Kazuki, Hirano, Masato, Shibuya, Takashi, Koyama, Yuichiro, Zhong, Zhi, Takahashi, Shusuke, Kawahara, Tatsuya, Mitsufuji, Yuki

arXiv.org Artificial IntelligenceMay-18-2023

Diffusion-based speech enhancement (SE) has been investigated recently, but its decoding is very time-consuming. One solution is to initialize the decoding process with the enhanced feature estimated by a predictive SE system. However, this two-stage method ignores the complementarity between predictive and diffusion SE. In this paper, we propose a unified system that integrates these two SE modules. The system encodes both generative and predictive information, and then applies both generative and predictive decoders, whose outputs are fused. Specifically, the two SE modules are fused in the first and final diffusion steps: the first step fusion initializes the diffusion process with the predictive SE for improving the convergence, and the final step fusion combines the two complementary SE outputs to improve the SE performance. Experiments on the Voice-Bank dataset show that the diffusion score estimation can benefit from the predictive information and speed up the decoding.

artificial intelligence, diffusion step, machine learning, (16 more...)

arXiv.org Artificial Intelligence

2305.10734

Country: Asia > Japan (0.29)

Genre: Workflow (1.00)

Technology:

Information Technology > Data Science (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.47)

Add feedback

Improving Character Error Rate Is Not Equal to Having Clean Speech: Speech Enhancement for ASR Systems with Black-box Acoustic Models

Sawata, Ryosuke, Kashiwagi, Yosuke, Takahashi, Shusuke

arXiv.org Artificial IntelligenceOct-12-2021

A deep neural network (DNN)-based speech enhancement (SE) aiming to maximize the performance of an automatic speech recognition (ASR) system is proposed in this paper. In order to optimize the DNN-based SE model in terms of the character error rate (CER), which is one of the metric to evaluate the ASR system and generally non-differentiable, our method uses two DNNs: one for speech processing and one for mimicking the output CERs derived through an acoustic model (AM). Then both of DNNs are alternately optimized in the training phase. Even if the AM is a black-box, e.g., like one provided by a third-party, the proposed method enables the DNN-based SE model to be optimized in terms of the CER since the DNN mimicking the AM is differentiable. Consequently, it becomes feasible to build CER-centric SE model that has no negative effect, e.g., additional calculation cost and changing network architecture, on the inference phase since our method is merely a training scheme for the existing DNN-based methods. Experimental results show that our method improved CER by 7.3% relative derived through a black-box AM although certain noise levels are kept.

artificial intelligence, machine learning, neural network, (17 more...)

arXiv.org Artificial Intelligence

2110.05968

Country: Asia > Japan > Honshū (0.14)

Genre: Research Report > New Finding (0.49)

Industry: Transportation > Air (0.83)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback