AITopics | second speaker

Collaborating Authors

second speaker

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Talking Turns: Benchmarking Audio Foundation Models on Turn-Taking Dynamics

Arora, Siddhant, Lu, Zhiyun, Chiu, Chung-Cheng, Pang, Ruoming, Watanabe, Shinji

arXiv.org Artificial IntelligenceMar-2-2025

The recent wave of audio foundation models (FMs) could provide new capabilities for conversational modeling. However, there have been limited efforts to evaluate these audio FMs comprehensively on their ability to have natural and interactive conversations. To engage in meaningful conversation with the end user, we would want the FMs to additionally perform a fluent succession of turns without too much overlapping speech or long stretches of silence. Inspired by this, we ask whether the recently proposed audio FMs can understand, predict, and perform turn-taking events? To answer this, we propose a novel evaluation protocol that can assess spoken dialog system's turn-taking capabilities using a supervised model as a judge that has been trained to predict turn-taking events in human-human conversations. Using this protocol, we present the first comprehensive user study that evaluates existing spoken dialogue systems on their ability to perform turn-taking events and reveal many interesting insights, such as they sometimes do not understand when to speak up, can interrupt too aggressively and rarely backchannel. We further evaluate multiple open-source and proprietary audio FMs accessible through APIs on carefully curated test benchmarks from Switchboard to measure their ability to understand and predict turn-taking events and identify significant room for improvement. We will open source our evaluation platform to promote the development of advanced conversational AI systems.

backchannel, interruption, turn-taking event, (15 more...)

arXiv.org Artificial Intelligence

2503.01174

Country:

North America > United States > Rhode Island (0.04)
Europe > Greece (0.04)
Asia > Singapore (0.04)
(10 more...)

Genre: Research Report > New Finding (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Discourse & Dialogue (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.69)
(3 more...)

Add feedback

Advancing Singlish Understanding: Bridging the Gap with Datasets and Multimodal Models

Wang, Bin, Zou, Xunlong, Sun, Shuo, Zhang, Wenyu, He, Yingxu, Liu, Zhuohan, Wei, Chengwei, Chen, Nancy F., Aw, AiTi

arXiv.org Artificial IntelligenceJan-10-2025

Existing Singlish spoken corpora have primarily focused on linguistic analysis and speech recognition Speech technologies have evolved over decades, tasks (Deterding and Low, 2001; Chen et al., progressing from modularized solutions for speech 2010; Lyu et al., 2010; Tan, 2019). Given the relatively recognition (Povey et al., 2011; Radford et al., small population of Singlish speakers, estimated 2023), speaker identification (Togneri and Pullella, at just a few million, resources for Singlish 2011), and gender recognition (Hechmi et al., speech corpora are significantly more limited compared 2021) with modularized toolkits like Kaldi (Povey to major languages like English, Chinese, et al., 2011) and ESPnet (Watanabe et al., 2018) French, and Spanish. Singapore's government to advanced solutions integrating large language agency, IMDA, has open-sourced the largest available models for multimodal understanding in an allencompassing, Singlish corpus, known as the National Speech omni-style approach (Team et al., Corpus (Koh et al., 2019).

artificial intelligence, machine learning, natural language, (16 more...)

arXiv.org Artificial Intelligence

2501.01034

Country: Asia > Singapore (0.58)

Genre: Research Report (1.00)

Industry: Government (0.34)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

Towards Probing Speech-Specific Risks in Large Multimodal Models: A Taxonomy, Benchmark, and Insights

Yang, Hao, Qu, Lizhen, Shareghi, Ehsan, Haffari, Gholamreza

arXiv.org Artificial IntelligenceJun-25-2024

Large Multimodal Models (LMMs) have achieved great success recently, demonstrating a strong capability to understand multimodal information and to interact with human users. Despite the progress made, the challenge of detecting high-risk interactions in multimodal settings, and in particular in speech modality, remains largely unexplored. Conventional research on risk for speech modality primarily emphasises the content (e.g., what is captured as transcription). However, in speech-based interactions, paralinguistic cues in audio can significantly alter the intended meaning behind utterances. In this work, we propose a speech-specific risk taxonomy, covering 8 risk categories under hostility (malicious sarcasm and threats), malicious imitation (age, gender, ethnicity), and stereotypical biases (age, gender, ethnicity). Based on the taxonomy, we create a small-scale dataset for evaluating current LMMs capability in detecting these categories of risk. We observe even the latest models remain ineffective to detect various paralinguistic-specific risks in speech (e.g., Gemini 1.5 Pro is performing only slightly above random baseline). Warning: this paper contains biased and offensive examples.

indication, second speaker, speech, (14 more...)

arXiv.org Artificial Intelligence

2406.1743

Country:

North America > United States > Louisiana > Orleans Parish > New Orleans (0.04)
North America > United States > Hawaii > Honolulu County > Honolulu (0.04)

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Speech (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.91)

Add feedback