AITopics | Li, Ya

Collaborating Authors

Li, Ya

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Psy-Insight: Explainable Multi-turn Bilingual Dataset for Mental Health Counseling

Chen, Keqi, Sun, Zekai, Wen, Yuhua, Lian, Huijun, Gao, Yingming, Li, Ya

arXiv.org Artificial IntelligenceMar-5-2025

The in-context learning capabilities of large language models (LLMs) show great potential in mental health support. However, the lack of counseling datasets, particularly in Chinese corpora, restricts their application in this field. To address this, we constructed Psy-Insight, the first mental health-oriented explainable multi-task bilingual dataset. We collected face-to-face multi-turn counseling dialogues, which are annotated with multi-task labels and conversation process explanations. Our annotations include psychotherapy, emotion, strategy, and topic labels, as well as turn-level reasoning and session-level guidance. Psy-Insight is not only suitable for tasks such as label recognition but also meets the need for training LLMs to act as empathetic counselors through logical reasoning. Experiments show that training LLMs on Psy-Insight enables the models to not only mimic the conversation style but also understand the underlying strategies and reasoning of counseling.

large language model, machine learning, natural language, (21 more...)

arXiv.org Artificial Intelligence

2503.03607

Country: North America > United States (0.67)

Genre: Research Report (0.82)

Industry: Health & Medicine > Therapeutic Area > Psychiatry/Psychology > Mental Health (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.47)

Add feedback

Psy-Copilot: Visual Chain of Thought for Counseling

Chen, Keqi, Sun, Zekai, Lian, Huijun, Gao, Yingming, Li, Ya

arXiv.org Artificial IntelligenceMar-5-2025

Large language models (LLMs) are becoming increasingly popular in the field of psychological counseling. However, when human therapists work with LLMs in therapy sessions, it is hard to understand how the model gives the answers. To address this, we have constructed Psy-COT, a graph designed to visualize the thought processes of LLMs during therapy sessions. The Psy-COT graph presents semi-structured counseling conversations alongside step-by-step annotations that capture the reasoning and insights of therapists. Moreover, we have developed Psy-Copilot, which is a conversational AI assistant designed to assist human psychological therapists in their consultations. It can offer traceable psycho-information based on retrieval, including response candidates, similar dialogue sessions, related strategies, and visual traces of results. We have also built an interactive platform for AI-assisted counseling. It has an interface that displays the relevant parts of the retrieval sub-graph. The Psy-Copilot is designed not to replace psychotherapists but to foster collaboration between AI and human therapists, thereby promoting mental health development. Our code and demo are both open-sourced and available for use.

artificial intelligence, large language model, natural language, (17 more...)

arXiv.org Artificial Intelligence

2503.03645

Country:

Asia > Thailand (0.15)
Asia > China (0.15)

Genre: Research Report (0.50)

Industry: Health & Medicine > Therapeutic Area > Psychiatry/Psychology (0.36)

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

Add feedback

ICAGC 2024: Inspirational and Convincing Audio Generation Challenge 2024

Fu, Ruibo, Liu, Rui, Qiang, Chunyu, Gao, Yingming, Lu, Yi, Wang, Tao, Li, Ya, Wen, Zhengqi, Zhang, Chen, Bu, Hui, Liu, Yukun, Shi, Shuchen, Qi, Xin, Li, Guanjun

arXiv.org Artificial IntelligenceJul-1-2024

The Inspirational and Convincing Audio Generation Challenge 2024 (ICAGC 2024) is part of the ISCSLP 2024 Competitions and Challenges track. While current text-to-speech (TTS) technology can generate high-quality audio, its ability to convey complex emotions and controlled detail content remains limited. This constraint leads to a discrepancy between the generated audio and human subjective perception in practical applications like companion robots for children and marketing bots. The core issue lies in the inconsistency between high-quality audio generation and the ultimate human subjective experience. Therefore, this challenge aims to enhance the persuasiveness and acceptability of synthesized audio, focusing on human alignment convincing and inspirational audio generation.

artificial intelligence, icagc 2024, machine learning, (13 more...)

arXiv.org Artificial Intelligence

2407.12038

Country: Asia > China (0.47)

Genre: Research Report (0.50)

Industry: Information Technology > Security & Privacy (0.30)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Synthesis (0.35)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.30)

Add feedback

SPA-SVC: Self-supervised Pitch Augmentation for Singing Voice Conversion

Bai, Bingsong, Wang, Fengping, Gao, Yingming, Li, Ya

arXiv.org Artificial IntelligenceJun-9-2024

Diffusion-based singing voice conversion (SVC) models have shown better synthesis quality compared to traditional methods. However, in cross-domain SVC scenarios, where there is a significant disparity in pitch between the source and target voice domains, the models tend to generate audios with hoarseness, posing challenges in achieving high-quality vocal outputs. Therefore, in this paper, we propose a Self-supervised Pitch Augmentation method for Singing Voice Conversion (SPA-SVC), which can enhance the voice quality in SVC tasks without requiring additional data or increasing model parameters. We innovatively introduce a cycle pitch shifting training strategy and Structural Similarity Index (SSIM) loss into our SVC model, effectively enhancing its performance. Experimental results on the public singing datasets M4Singer indicate that our proposed method significantly improves model performance in both general SVC scenarios and particularly in cross-domain SVC scenarios.

artificial intelligence, machine learning, svc scenario, (16 more...)

arXiv.org Artificial Intelligence

2406.05692

Country: Asia > China (0.14)

Genre: Research Report (0.82)

Industry: Media (0.46)

Technology:

Information Technology > Artificial Intelligence > Speech (0.94)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.68)

Add feedback

Improving Audio Codec-based Zero-Shot Text-to-Speech Synthesis with Multi-Modal Context and Large Language Model

Xue, Jinlong, Deng, Yayue, Han, Yicheng, Gao, Yingming, Li, Ya

arXiv.org Artificial IntelligenceJun-5-2024

Recent advances in large language models (LLMs) and development of audio codecs greatly propel the zero-shot TTS. They can synthesize personalized speech with only a 3-second speech of an unseen speaker as acoustic prompt. However, they only support short speech prompts and cannot leverage longer context information, as required in audiobook and conversational TTS scenarios. In this paper, we introduce a novel audio codec-based TTS model to adapt context features with multiple enhancements. Inspired by the success of Qformer, we propose a multi-modal context-enhanced Qformer (MMCE-Qformer) to utilize additional multi-modal context information. Besides, we adapt a pretrained LLM to leverage its understanding ability to predict semantic tokens, and use a SoundStorm to generate acoustic tokens thereby enhancing audio quality and speaker similarity. The extensive objective and subjective evaluations show that our proposed method outperforms baselines across various context TTS scenarios.

information, large language model, natural language, (13 more...)

arXiv.org Artificial Intelligence

2406.03706

Country: Asia > China (0.14)

Genre: Research Report (0.50)

Industry: Media (0.36)

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

Add feedback

Auffusion: Leveraging the Power of Diffusion and Large Language Models for Text-to-Audio Generation

Xue, Jinlong, Deng, Yayue, Gao, Yingming, Li, Ya

arXiv.org Artificial IntelligenceJan-2-2024

Recent advancements in diffusion models and large language models (LLMs) have significantly propelled the field of AIGC. Text-to-Audio (TTA), a burgeoning AIGC application designed to generate audio from natural language prompts, is attracting increasing attention. However, existing TTA studies often struggle with generation quality and text-audio alignment, especially for complex textual inputs. Drawing inspiration from state-of-the-art Text-to-Image (T2I) diffusion models, we introduce Auffusion, a TTA system adapting T2I model frameworks to TTA task, by effectively leveraging their inherent generative strengths and precise cross-modal alignment. Our objective and subjective evaluations demonstrate that Auffusion surpasses previous TTA approaches using limited data and computational resource. Furthermore, previous studies in T2I recognizes the significant impact of encoder choice on cross-modal alignment, like fine-grained details and object bindings, while similar evaluation is lacking in prior TTA works. Through comprehensive ablation studies and innovative cross-attention map visualizations, we provide insightful assessments of text-audio alignment in TTA. Our findings reveal Auffusion's superior capability in generating audios that accurately match textual descriptions, which further demonstrated in several related tasks, such as audio style transfer, inpainting and other manipulations. Our implementation and demos are available at https://auffusion.github.io.

large language model, machine learning, natural language, (21 more...)

arXiv.org Artificial Intelligence

2401.01044

Country:

Europe (0.67)
North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
North America > United States > Maryland (0.14)

Genre: Research Report > New Finding (0.66)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.94)

Add feedback

Frame-level emotional state alignment method for speech emotion recognition

Li, Qifei, Gao, Yingming, Wang, Cong, Deng, Yayue, Xue, Jinlong, Han, Yichen, Li, Ya

arXiv.org Artificial IntelligenceDec-26-2023

Speech emotion recognition (SER) systems aim to recognize human emotional state during human-computer interaction. Most existing SER systems are trained based on utterance-level labels. However, not all frames in an audio have affective states consistent with utterance-level label, which makes it difficult for the model to distinguish the true emotion of the audio and perform poorly. To address this problem, we propose a frame-level emotional state alignment method for SER. First, we fine-tune HuBERT model to obtain a SER system with task-adaptive pretraining (TAPT) method, and extract embeddings from its transformer layers to form frame-level pseudo-emotion labels with clustering. Then, the pseudo labels are used to pretrain HuBERT. Hence, the each frame output of HuBERT has corresponding emotional information. Finally, we fine-tune the above pretrained HuBERT for SER by adding an attention layer on the top of it, which can focus only on those frames that are emotionally more consistent with utterance-level label. The experimental results performed on IEMOCAP indicate that our proposed method performs better than state-of-the-art (SOTA) methods.

emotion recognition, machine learning, natural language, (16 more...)

arXiv.org Artificial Intelligence

2312.16383

Country: Asia > China (0.14)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Speech (0.70)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.70)
Information Technology > Artificial Intelligence > Cognitive Science > Emotion (0.64)

Add feedback

Hypergraph Enhanced Knowledge Tree Prompt Learning for Next-Basket Recommendation

Mai, Zi-Feng, Wang, Chang-Dong, Zeng, Zhongjie, Li, Ya, Chen, Jiaquan, Yu, Philip S.

arXiv.org Artificial IntelligenceDec-25-2023

Next-basket recommendation (NBR) aims to infer the items in the next basket given the corresponding basket sequence. Existing NBR methods are mainly based on either message passing in a plain graph or transition modelling in a basket sequence. However, these methods only consider point-to-point binary item relations while item dependencies in real world scenarios are often in higher order. Additionally, the importance of the same item to different users varies due to variation of user preferences, and the relations between items usually involve various aspects. As pretrained language models (PLMs) excel in multiple tasks in natural language processing (NLP) and computer vision (CV), many researchers have made great efforts in utilizing PLMs to boost recommendation. However, existing PLM-based recommendation methods degrade when encountering Out-Of-Vocabulary (OOV) items. OOV items are those whose IDs are out of PLM's vocabulary and thus unintelligible to PLM. To settle the above challenges, we propose a novel method HEKP4NBR, which transforms the knowledge graph (KG) into prompts, namely Knowledge Tree Prompt (KTP), to help PLM encode the OOV item IDs in the user's basket sequence. A hypergraph convolutional module is designed to build a hypergraph based on item similarities measured by an MoE model from multiple aspects and then employ convolution on the hypergraph to model correlations among multiple items. Extensive experiments are conducted on HEKP4NBR on two datasets based on real company data and validate its effectiveness against multiple state-of-the-art methods.

artificial intelligence, machine learning, natural language, (16 more...)

arXiv.org Artificial Intelligence

2312.15851

Country:

Europe (1.00)
Asia > China > Guangdong Province (0.28)
North America > United States > California > San Francisco County > San Francisco (0.14)

Genre: Research Report > Promising Solution (0.54)

Industry:

Health & Medicine > Therapeutic Area > Infections and Infectious Diseases (0.46)
Health & Medicine > Therapeutic Area > Immunology (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Personal Assistant Systems (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

Add feedback

CONCSS: Contrastive-based Context Comprehension for Dialogue-appropriate Prosody in Conversational Speech Synthesis

Deng, Yayue, Xue, Jinlong, Jia, Yukang, Li, Qifei, Han, Yichen, Wang, Fengping, Gao, Yingming, Ke, Dengfeng, Li, Ya

arXiv.org Artificial IntelligenceDec-16-2023

Conversational speech synthesis (CSS) incorporates historical dialogue as supplementary information with the aim of generating speech that has dialogue-appropriate prosody. While previous methods have already delved into enhancing context comprehension, context representation still lacks effective representation capabilities and context-sensitive discriminability. In this paper, we introduce a contrastive learning-based CSS framework, CONCSS. Within this framework, we define an innovative pretext task specific to CSS that enables the model to perform self-supervised learning on unlabeled conversational datasets to boost the model's context understanding. Additionally, we introduce a sampling strategy for negative sample augmentation to enhance context vectors' discriminability. This is the first attempt to integrate contrastive learning into CSS. We conduct ablation studies on different contrastive learning strategies and comprehensive experiments in comparison with prior CSS systems. Results demonstrate that the synthesized speech from our proposed method exhibits more contextually appropriate and sensitive prosody.

artificial intelligence, machine learning, prosody, (15 more...)

arXiv.org Artificial Intelligence

2312.10358

Country: Asia > China (0.16)

Genre: Research Report (0.84)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Speech > Speech Synthesis (0.75)

Add feedback

Rhythm-controllable Attention with High Robustness for Long Sentence Speech Synthesis

Ke, Dengfeng, Deng, Yayue, Jia, Yukang, Xue, Jinlong, Luo, Qi, Li, Ya, Sun, Jianqing, Liang, Jiaen, Lin, Binghuai

arXiv.org Artificial IntelligenceJun-5-2023

Regressive Text-to-Speech (TTS) system utilizes attention mechanism to generate alignment between text and acoustic feature sequence. Alignment determines synthesis robustness (e.g, the occurence of skipping, repeating, and collapse) and rhythm via duration control. However, current attention algorithms used in speech synthesis cannot control rhythm using external duration information to generate natural speech while ensuring robustness. In this study, we propose Rhythm-controllable Attention (RC-Attention) based on Tracotron2, which improves robustness and naturalness simultaneously. Proposed attention adopts a trainable scalar learned from four kinds of information to achieve rhythm control, which makes rhythm control more robust and natural, even when synthesized sentences are extremely longer than training corpus. We use word errors counting and AB preference test to measure robustness of proposed method and naturalness of synthesized speech, respectively. Results shows that RC-Attention has the lowest word error rate of nearly 0.6%, compared with 11.8% for baseline system. Moreover, nearly 60% subjects prefer to the speech synthesized with RC-Attention to that with Forward Attention, because the former has more natural rhythm.

artificial intelligence, information, machine learning, (15 more...)

arXiv.org Artificial Intelligence

doi: 10.1109/ISCSLP57327.2022.10037822

2306.02593

Country: Asia > China (0.16)

Genre: Research Report > New Finding (0.87)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Synthesis (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback