AITopics | Mun, Seongkyu

Collaborating Authors

Mun, Seongkyu

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

AdaptVC: High Quality Voice Conversion with Adaptive Learning

Kim, Jaehun, Kim, Ji-Hoon, Choi, Yeunju, Nguyen, Tan Dat, Mun, Seongkyu, Chung, Joon Son

arXiv.org Artificial IntelligenceJan-14-2025

The goal of voice conversion is to transform the speech of a source speaker to sound like that of a reference speaker while preserving the original content. A key challenge is to extract disentangled linguistic content from the source and voice style from the reference. While existing approaches leverage various methods to isolate the two, a generalization still requires further attention, especially for robustness in zero-shot scenarios. In this paper, we achieve successful disentanglement of content and speaker features by tuning self-supervised speech features with adapters. The adapters are trained to dynamically encode nuanced features from rich self-supervised features, and the decoder fuses them to produce speech that accurately resembles the reference with minimal loss of content. Moreover, we leverage a conditional flow matching decoder with cross-attention speaker conditioning to further boost the synthesis quality and efficiency. Subjective and objective evaluations in a zero-shot scenario demonstrate that the proposed method outperforms existing models in speech quality and similarity to the reference speech.

large language model, machine learning, proc, (18 more...)

arXiv.org Artificial Intelligence

2501.01347

Country: Asia (0.14)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Speech (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.70)

Add feedback

When Vision Models Meet Parameter Efficient Look-Aside Adapters Without Large-Scale Audio Pretraining

Yeo, Juan, Jang, Jinkwan, Chae, Kyubyung, Mun, Seongkyu, Kim, Taesup

arXiv.org Artificial IntelligenceDec-8-2024

Recent studies show that pretrained vision models can boost performance in audio downstream tasks. To enhance the performance further, an additional pretraining stage with large-scale audio data is typically required to infuse audio-specific knowledge into the vision model. However, such approaches require extensive audio data and a carefully designed objective function. In this work, we propose bypassing the pretraining stage by directly fine-tuning the vision model with our Look-Aside Adapter (LoAA) designed for efficient audio understanding. Audio spectrum data is represented across two heterogeneous dimensions--time and frequency--and we refine adapters to facilitate interactions between tokens across these dimensions. Our experiments demonstrate that our adapters allow vision Figure 1: An illustration of our simplified approach for audio models to reach or surpass the performance of pretrained audio classification. Our newly proposed Parameter Efficient Fine-models in various audio and speech tasks, offering a resourceefficient Tuning (PEFT) paradigm for audio classification is a direct and effective solution for leveraging vision models in adaptation to downstream tasks in a singular stage.

artificial intelligence, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2412.05951

Genre: Research Report > New Finding (0.48)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.69)

Add feedback

Into-TTS : Intonation Template Based Prosody Control System

Lee, Jihwan, Lee, Joun Yeop, Choi, Heejin, Mun, Seongkyu, Park, Sangjun, Bae, Jae-Sung, Kim, Chanwoo

arXiv.org Artificial IntelligenceNov-6-2022

Intonations play an important role in delivering the intention of a speaker. However, current end-to-end TTS systems often fail to model proper intonations. To alleviate this problem, we propose a novel, intuitive method to synthesize speech in different intonations using predefined intonation templates. Prior to TTS model training, speech data are grouped into intonation templates in an unsupervised manner. Two proposed modules are added to the end-to-end TTS framework: an intonation predictor and an intonation encoder. The intonation predictor recommends a suitable intonation template to the given text. The intonation encoder, attached to the text encoder output, synthesizes speech abiding the requested intonation template. Main contributions of our paper are: (a) an easy-to-use intonation control system covering a wide range of users; (b) better performance in wrapping speech in a requested intonation with improved objective and subjective evaluation; and (c) incorporating a pre-trained language model for intonation modelling. Audio samples are available at https://srtts.github.io/IntoTTS.

intonation, machine learning, natural language, (15 more...)

arXiv.org Artificial Intelligence

2204.01271

Country: North America > United States > Minnesota (0.28)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Natural Language (0.89)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.70)

Add feedback