AITopics | Jung, Jee-weon

Collaborating Authors

Jung, Jee-weon

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

AugSumm: towards generalizable speech summarization using synthetic labels from large language model

Jung, Jee-weon, Sharma, Roshan, Chen, William, Raj, Bhiksha, Watanabe, Shinji

arXiv.org Artificial IntelligenceJan-10-2024

Abstractive speech summarization (SSUM) aims to generate human-like summaries from speech. Given variations in information captured and phrasing, recordings can be summarized in multiple ways. Therefore, it is more reasonable to consider a probabilistic distribution of all potential summaries rather than a single summary. However, conventional SSUM models are mostly trained and evaluated with a single ground-truth (GT) human-annotated deterministic summary for every recording. Generating multiple human references would be ideal to better represent the distribution statistically, but is impractical because annotation is expensive. We tackle this challenge by proposing AugSumm, a method to leverage large language models (LLMs) as a proxy for human annotators to generate augmented summaries for training and evaluation. First, we explore prompting strategies to generate synthetic summaries from ChatGPT. We validate the quality of synthetic summaries using multiple metrics including human evaluation, where we find that summaries generated using AugSumm are perceived as more valid to humans. Second, we develop methods to utilize synthetic summaries in training and evaluation. Experiments on How2 demonstrate that pre-training on synthetic summaries and fine-tuning on GT summaries improves ROUGE-L by 1 point on both GT and AugSumm-based test sets. AugSumm summaries are available at https://github.com/Jungjee/AugSumm.

large language model, machine learning, natural language, (21 more...)

arXiv.org Artificial Intelligence

2401.06806

Country:

North America > United States (0.14)
Asia (0.14)

Genre: Research Report (0.50)

Industry:

Health & Medicine > Therapeutic Area (0.52)
Health & Medicine > Consumer Health (0.47)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.49)

Add feedback

Understanding Probe Behaviors through Variational Bounds of Mutual Information

Choi, Kwanghee, Jung, Jee-weon, Watanabe, Shinji

arXiv.org Artificial IntelligenceDec-15-2023

With the success of self-supervised representations, researchers seek a better understanding of the information encapsulated within a representation. Among various interpretability methods, we focus on classification-based linear probing. We aim to foster a solid understanding and provide guidelines for linear probing by constructing a novel mathematical framework leveraging information theory. First, we connect probing with the variational bounds of mutual information (MI) to relax the probe design, equating linear probing with fine-tuning. Then, we investigate empirical behaviors and practices of probing through our mathematical framework. We analyze the layer-wise performance curve being convex, which seemingly violates the data processing inequality. However, we show that the intermediate representations can have the biggest MI estimate because of the tradeoff between better separability and decreasing MI. We further suggest that the margin of linearly separable representations can be a criterion for measuring the "goodness of representation." We also compare accuracy with MI as the measuring criteria. Finally, we empirically validate our claims by observing the self-supervised speech models on retaining word and phoneme information.

artificial intelligence, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2312.10019

Country:

Europe (0.93)
North America > United States (0.46)

Genre: Research Report (0.40)

Technology:

Information Technology > Artificial Intelligence > Natural Language (0.95)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.94)

Add feedback

Reproducing Whisper-Style Training Using an Open-Source Toolkit and Publicly Available Data

Peng, Yifan, Tian, Jinchuan, Yan, Brian, Berrebbi, Dan, Chang, Xuankai, Li, Xinjian, Shi, Jiatong, Arora, Siddhant, Chen, William, Sharma, Roshan, Zhang, Wangyou, Sudo, Yui, Shakeel, Muhammad, Jung, Jee-weon, Maiti, Soumi, Watanabe, Shinji

arXiv.org Artificial IntelligenceOct-24-2023

Pre-training speech models on large volumes of data has achieved remarkable success. OpenAI Whisper is a multilingual multitask model trained on 680k hours of supervised speech data. It generalizes well to various speech recognition and translation benchmarks even in a zero-shot setup. However, the full pipeline for developing such models (from data collection to training) is not publicly accessible, which makes it difficult for researchers to further improve its performance and address training-related issues such as efficiency, robustness, fairness, and bias. This work presents an Open Whisper-style Speech Model (OWSM), which reproduces Whisper-style training using an open-source toolkit and publicly available data. OWSM even supports more translation directions and can be more efficient to train. We will publicly release all scripts used for data preparation, training, inference, and scoring as well as pre-trained models and training logs to promote open science.

large language model, machine learning, reproducing whisper-style training, (4 more...)

arXiv.org Artificial Intelligence

2309.13876

Genre: Research Report (0.40)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.53)
Information Technology > Artificial Intelligence > Machine Learning (0.53)

Add feedback

UniverSLU: Universal Spoken Language Understanding for Diverse Classification and Sequence Generation Tasks with a Single Network

Arora, Siddhant, Futami, Hayato, Jung, Jee-weon, Peng, Yifan, Sharma, Roshan, Kashiwagi, Yosuke, Tsunoo, Emiru, Watanabe, Shinji

arXiv.org Artificial IntelligenceOct-4-2023

Recent studies have demonstrated promising outcomes by employing large language models with multi-tasking capabilities. They utilize prompts to guide the model's behavior and surpass performance of task-specific models. Motivated by this, we ask: can we build a single model that jointly perform various spoken language understanding (SLU) tasks? To address this, we utilize pre-trained automatic speech recognition (ASR) models and employ various task and dataset specifiers as discrete prompts. We demonstrate efficacy of our single multi-task learning (MTL) model "UniverSLU" for 12 different speech classification and sequence generation tasks across 17 datasets and 9 languages. Results show that UniverSLU achieves competitive performance and even surpasses task-specific models. We also conduct preliminary investigations into enabling human-interpretable natural phrases instead of task specifiers as discrete prompts and test the model's generalization capabilities to new paraphrases.

artificial intelligence, natural language, speech recognition, (4 more...)

arXiv.org Artificial Intelligence

2310.02973

Genre: Research Report > New Finding (1.00)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)

Add feedback

One model to rule them all ? Towards End-to-End Joint Speaker Diarization and Speech Recognition

Cornell, Samuele, Jung, Jee-weon, Watanabe, Shinji, Squartini, Stefano

arXiv.org Artificial IntelligenceOct-2-2023

This paper presents a novel framework for joint speaker diarization (SD) and automatic speech recognition (ASR), named SLIDAR (sliding-window diarization-augmented recognition). SLIDAR can process arbitrary length inputs and can handle any number of speakers, effectively solving ``who spoke what, when'' concurrently. SLIDAR leverages a sliding window approach and consists of an end-to-end diarization-augmented speech transcription (E2E DAST) model which provides, locally, for each window: transcripts, diarization and speaker embeddings. The E2E DAST model is based on an encoder-decoder architecture and leverages recent techniques such as serialized output training and ``Whisper-style" prompting. The local outputs are then combined to get the final SD+ASR result by clustering the speaker embeddings to get global speaker identities. Experiments performed on monaural recordings from the AMI corpus confirm the effectiveness of the method in both close-talk and far-field speech scenarios.

artificial intelligence, recognition, speech recognition, (17 more...)

arXiv.org Artificial Intelligence

2310.01688

Country: Europe > Italy (0.14)

Genre: Research Report (0.50)

Technology: Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)

Add feedback

Encoder-decoder multimodal speaker change detection

Jung, Jee-weon, Seo, Soonshin, Heo, Hee-Soo, Kim, Geonmin, Kim, You Jin, Kwon, Young-ki, Lee, Minjae, Lee, Bong-Jin

arXiv.org Artificial IntelligenceJun-1-2023

The task of speaker change detection (SCD), which detects points where speakers change in an input, is essential for several applications. Several studies solved the SCD task using audio inputs only and have shown limited performance. Recently, multimodal SCD (MMSCD) models, which utilise text modality in addition to audio, have shown improved performance. In this study, the proposed model are built upon two main proposals, a novel mechanism for modality fusion and the adoption of a encoder-decoder architecture. Different to previous MMSCD works that extract speaker embeddings from extremely short audio segments, aligned to a single word, we use a speaker embedding extracted from 1.5s. A transformer decoder layer further improves the performance of an encoder-only MMSCD model. The proposed model achieves state-of-the-art results among studies that report SCD performance and is also on par with recent work that combines SCD with automatic speech recognition via human transcription.

artificial intelligence, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2306.0068

Genre: Research Report > New Finding (0.34)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.94)
Information Technology > Artificial Intelligence > Speech > Speech Recognition (0.92)

Add feedback

Towards single integrated spoofing-aware speaker verification embeddings

Mun, Sung Hwan, Shim, Hye-jin, Tak, Hemlata, Wang, Xin, Liu, Xuechen, Sahidullah, Md, Jeong, Myeonghun, Han, Min Hyun, Todisco, Massimiliano, Lee, Kong Aik, Yamagishi, Junichi, Evans, Nicholas, Kinnunen, Tomi, Kim, Nam Soo, Jung, Jee-weon

arXiv.org Artificial IntelligenceJun-1-2023

This study aims to develop a single integrated spoofing-aware speaker verification (SASV) embeddings that satisfy two aspects. First, rejecting non-target speakers' input as well as target speakers' spoofed inputs should be addressed. Second, competitive performance should be demonstrated compared to the fusion of automatic speaker verification (ASV) and countermeasure (CM) embeddings, which outperformed single embedding solutions by a large margin in the SASV2022 challenge. We analyze that the inferior performance of single SASV embeddings comes from insufficient amount of training data and distinct nature of ASV and CM tasks. To this end, we propose a novel framework that includes multi-stage training and a combination of loss functions. Copy synthesis, combined with several vocoders, is also exploited to address the lack of spoofed data. Experimental results show dramatic improvements, achieving a SASV-EER of 1.06% on the evaluation protocol of the SASV2022 challenge.

artificial intelligence, machine learning, speaker verification, (19 more...)

arXiv.org Artificial Intelligence

2305.19051

Country:

Asia > South Korea (0.14)
North America > United States (0.14)

Genre: Research Report (0.70)

Industry: Information Technology > Security & Privacy (0.47)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Speech > Speech Recognition (0.95)
Information Technology > Artificial Intelligence > Speech > Acoustic Processing (0.85)

Add feedback

VoxSRC 2022: The Fourth VoxCeleb Speaker Recognition Challenge

Huh, Jaesung, Brown, Andrew, Jung, Jee-weon, Chung, Joon Son, Nagrani, Arsha, Garcia-Romero, Daniel, Zisserman, Andrew

arXiv.org Artificial IntelligenceMar-6-2023

This paper summarises the findings from the VoxCeleb Speaker Recognition Challenge 2022 (VoxSRC-22), which was held in conjunction with INTERSPEECH 2022. The goal of this challenge was to evaluate how well state-of-the-art speaker recognition systems can diarise and recognise speakers from speech obtained "in the wild". The challenge consisted of: (i) the provision of publicly available speaker recognition and diarisation data from YouTube videos together with ground truth annotation and standardised evaluation software; and (ii) a public challenge and hybrid workshop held at INTERSPEECH 2022. We describe the four tracks of our challenge along with the baselines, methods, and results. We conclude with a discussion on the new domain-transfer focus of VoxSRC-22, and on the progression of the challenge from the previous three editions.

arxiv preprint arxiv, machine learning, pattern recognition, (17 more...)

arXiv.org Artificial Intelligence

2302.10248

Country:

Asia > South Korea (0.14)
Asia > China (0.14)
Europe > United Kingdom > England > Oxfordshire > Oxford (0.14)

Genre: Research Report (0.64)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Pattern Recognition > Speech Recognition (1.00)

Add feedback

In search of strong embedding extractors for speaker diarisation

Jung, Jee-weon, Heo, Hee-Soo, Lee, Bong-Jin, Huh, Jaesung, Brown, Andrew, Kwon, Youngki, Watanabe, Shinji, Chung, Joon Son

arXiv.org Artificial IntelligenceOct-26-2022

Speaker embedding extractors (EEs), which map input audio to a speaker discriminant latent space, are of paramount importance in speaker diarisation. However, there are several challenges when adopting EEs for diarisation, from which we tackle two key problems. First, the evaluation is not straightforward because the features required for better performance differ between speaker verification and diarisation. We show that better performance on widely adopted speaker verification evaluation protocols does not lead to better diarisation performance. Second, embedding extractors have not seen utterances in which multiple speakers exist. These inputs are inevitably present in speaker diarisation because of overlapped speech and speaker changes; they degrade the performance. To mitigate the first problem, we generate speaker verification evaluation protocols that mimic the diarisation scenario better. We propose two data augmentation techniques to alleviate the second problem, making embedding extractors aware of overlapped speech or speaker change input. One technique generates overlapped speech segments, and the other generates segments where two speakers utter sequentially. Extensive experimental results using three state-of-the-art speaker embedding extractors demonstrate that both proposed approaches are effective.

artificial intelligence, diarisation, machine learning, (18 more...)

arXiv.org Artificial Intelligence

2210.14682

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Speech > Acoustic Processing (0.77)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.69)

Add feedback

Multi-scale speaker embedding-based graph attention networks for speaker diarisation

Kwon, Youngki, Heo, Hee-Soo, Jung, Jee-weon, Kim, You Jin, Lee, Bong-Jin, Chung, Joon Son

arXiv.org Artificial IntelligenceOct-7-2021

The objective of this work is effective speaker diarisation using multi-scale speaker embeddings. Typically, there is a trade-off between the ability to recognise short speaker segments and the discriminative power of the embedding, according to the segment length used for embedding extraction. To this end, recent works have proposed the use of multi-scale embeddings where segments with varying lengths are used. However, the scores are combined using a weighted summation scheme where the weights are fixed after the training phase, whereas the importance of segment lengths can differ with in a single session. To address this issue, we present three key contributions in this paper: (1) we propose graph attention networks for multi-scale speaker diarisation; (2) we design scale indicators to utilise scale information of each embedding; (3) we adapt the attention-based aggregation to utilise a pre-computed affinity matrix from multi-scale embeddings. We demonstrate the effectiveness of our method in various datasets where the speaker confusion which constitutes the primary metric drops over 10% in average relative compared to the baseline.

artificial intelligence, baseline, machine learning, (16 more...)

arXiv.org Artificial Intelligence

2110.03361

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Speech (0.70)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.46)

Add feedback