AITopics | Fang, Ying

Collaborating Authors

Fang, Ying

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

CleanMel: Mel-Spectrogram Enhancement for Improving Both Speech Quality and ASR

Shao, Nian, Zhou, Rui, Wang, Pengyu, Li, Xian, Fang, Ying, Yang, Yujie, Li, Xiaofei

arXiv.org Artificial IntelligenceFeb-27-2025

In this work, we propose CleanMel, a single-channel Mel-spectrogram denoising and dereverberation network for improving both speech quality and automatic speech recognition (ASR) performance. The proposed network takes as input the noisy and reverberant microphone recording and predicts the corresponding clean Mel-spectrogram. The enhanced Mel-spectrogram can be either transformed to speech waveform with a neural vocoder or directly used for ASR. The proposed network is composed of interleaved cross-band and narrow-band processing in the Mel-frequency domain, for learning the full-band spectral pattern and the narrow-band properties of signals, respectively. Compared to linear-frequency domain or time-domain speech enhancement, the key advantage of Mel-spectrogram enhancement is that Mel-frequency presents speech in a more compact way and thus is easier to learn, which will benefit both speech quality and ASR. Experimental results on four English and one Chinese datasets demonstrate a significant improvement in both speech quality and ASR performance achieved by the proposed model. Code and audio examples of our model are available online in https://audio.westlake.edu.cn/Research/CleanMel.html.

artificial intelligence, enhancement, machine learning, (17 more...)

arXiv.org Artificial Intelligence

2502.2004

Country: Asia > China (0.14)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

VINP: Variational Bayesian Inference with Neural Speech Prior for Joint ASR-Effective Speech Dereverberation and Blind RIR Identification

Wang, Pengyu, Fang, Ying, Li, Xiaofei

arXiv.org Artificial IntelligenceFeb-10-2025

Reverberant speech, denoting the speech signal degraded by the process of reverberation, contains crucial knowledge of both anechoic source speech and room impulse response (RIR). This work proposes a variational Bayesian inference (VBI) framework with neural speech prior (VINP) for joint speech dereverberation and blind RIR identification. In VINP, a probabilistic signal model is constructed in the time-frequency (T-F) domain based on convolution transfer function (CTF) approximation. For the first time, we propose using an arbitrary discriminative dereverberation deep neural network (DNN) to predict the prior distribution of anechoic speech within a probabilistic model. By integrating both reverberant speech and the anechoic speech prior, VINP yields the maximum a posteriori (MAP) and maximum likelihood (ML) estimations of the anechoic speech spectrum and CTF filter, respectively. After simple transformations, the waveforms of anechoic speech and RIR are estimated. Moreover, VINP is effective for automatic speech recognition (ASR) systems, which sets it apart from most deep learning (DL)-based single-channel dereverberation approaches. Experiments on single-channel speech dereverberation demonstrate that VINP reaches an advanced level in most metrics related to human perception and displays unquestionable state-of-the-art (SOTA) performance in ASR-related metrics. For blind RIR identification, experiments indicate that VINP attains the SOTA level in blind estimation of reverberation time at 60 dB (RT60) and direct-to-reverberation ratio (DRR). Codes and audio samples are available online.

artificial intelligence, bayesian inference, machine learning, (19 more...)

arXiv.org Artificial Intelligence

2502.07205

Country:

Asia > China (0.14)
Europe > United Kingdom (0.14)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Speech (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty > Bayesian Inference (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (0.48)

Add feedback

Mamba for Streaming ASR Combined with Unimodal Aggregation

Fang, Ying, Li, Xiaofei

arXiv.org Artificial IntelligenceDec-27-2024

This paper works on streaming automatic speech recognition (ASR). Mamba, a recently proposed state space model, has demonstrated the ability to match or surpass Transformers in various tasks while benefiting from a linear complexity advantage. We explore the efficiency of Mamba encoder for streaming ASR and propose an associated lookahead mechanism for leveraging controllable future information. Additionally, a streaming-style unimodal aggregation (UMA) method is implemented, which automatically detects token activity and streamingly triggers token output, and meanwhile aggregates feature frames for better learning token representation. Based on UMA, an early termination (ET) method is proposed to further reduce recognition latency. Experiments conducted on two Mandarin Chinese datasets demonstrate that the proposed model achieves competitive ASR performance in terms of both recognition accuracy and latency.

artificial intelligence, latency, machine learning, (16 more...)

arXiv.org Artificial Intelligence

2410.0007

Genre: Research Report (0.40)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.69)

Add feedback

Evaluating the Design Features of an Intelligent Tutoring System for Advanced Mathematics Learning

Fang, Ying, He, Bo, Liu, Zhi, Liu, Sannyuya, Yan, Zhonghua, Sun, Jianwen

arXiv.org Artificial IntelligenceDec-22-2024

Xiaomai is an intelligent tutoring system (ITS) designed to help Chinese college students in learning advanced mathematics and preparing for the graduate school math entrance exam. This study investigates two distinctive features within Xiaomai: the incorporation of free-response questions with automatic feedback and the metacognitive element of reflecting on self-made errors. An experiment was conducted to evaluate the impact of these features on mathematics learning. One hundred and twenty college students were recruited and randomly assigned to four conditions: (1) multiple-choice questions without reflection, (2) multiple-choice questions with reflection, (3) free-response questions without reflection, and (4) free-response questions with reflection. Students in the multiple-choice conditions demonstrated better practice performance and learning outcomes compared to their counterparts in the freeresponse conditions. Additionally, the incorporation of error reflection did not yield a significant impact on students' practice performance or learning outcomes. These findings indicate that current design of free-response questions and the metacognitive feature of error reflection do not enhance the efficacy of the math ITS. This study highlights the need for redesign or enhancement of Xiaomai to optimize its effectiveness in facilitating advanced mathematics learning.

artificial intelligence, machine learning, natural language, (15 more...)

arXiv.org Artificial Intelligence

doi: 10.1007/978-3-031-64302-6_24

2412.17265

Country: Asia > China (0.15)

Genre:

Research Report > Strength High (1.00)
Research Report > New Finding (1.00)
Research Report > Experimental Study (1.00)

Industry:

Education > Educational Technology > Educational Software > Computer Based Training (1.00)
Education > Educational Setting > Higher Education (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning (0.88)
Information Technology > Artificial Intelligence > Cognitive Science (0.68)
Information Technology > Artificial Intelligence > Natural Language > Understanding (0.61)

Add feedback

Unimodal Aggregation for CTC-based Speech Recognition

Fang, Ying, Li, Xiaofei

arXiv.org Artificial IntelligenceSep-15-2023

This paper works on non-autoregressive automatic speech recognition. A unimodal aggregation (UMA) is proposed to segment and integrate the feature frames that belong to the same text token, and thus to learn better feature representations for text tokens. The frame-wise features and weights are both derived from an encoder. Then, the feature frames with unimodal weights are integrated and further processed by a decoder. Connectionist temporal classification (CTC) loss is applied for training. Compared to the regular CTC, the proposed method learns better feature representations and shortens the sequence length, resulting in lower recognition error and computational complexity. Experiments on three Mandarin datasets show that UMA demonstrates superior or comparable performance to other advanced non-autoregressive methods, such as self-conditioned CTC. Moreover, by integrating self-conditioned CTC into the proposed framework, the performance can be further noticeably improved.

artificial intelligence, machine learning, text token, (14 more...)

arXiv.org Artificial Intelligence

2309.0815

Country: Asia (0.47)

Genre: Research Report (0.40)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

Autoencoder Based Residual Deep Networks for Robust Regression Prediction and Spatiotemporal Estimation

Li, Lianfa, Fang, Ying, Wu, Jun, Wang, Jinfeng

arXiv.org Machine LearningDec-28-2018

To have a superior generalization, a deep learning neural network often involves a large size of training sample. With increase of hidden layers in order to increase learning ability, neural network has potential degradation in accuracy. Both could seriously limit applicability of deep learning in some domains particularly involving predictions of continuous variables with a small size of samples. Inspired by residual convolutional neural network in computer vision and recent findings of crucial shortcuts in the brains in neuroscience, we propose an autoencoder-based residual deep network for robust prediction. In a nested way, we leverage shortcut connections to implement residual mapping with a balanced structure for efficient propagation of error signals. The novel method is demonstrated by multiple datasets, imputation of high spatiotemporal resolution non-randomness missing values of aerosol optical depth, and spatiotemporal estimation of fine particulate matter <2.5 \mu m, achieving the cutting edge of accuracy and efficiency. Our approach is also a general-purpose regression learner to be applicable in diverse domains.

deep learning, neural network, residual deep network, (18 more...)

arXiv.org Machine Learning

1812.11262

Country:

North America > United States > California > Orange County > Irvine (0.14)
North America > United States > California > Los Angeles County > Los Angeles (0.14)

Genre:

Research Report > New Finding (0.93)
Research Report > Experimental Study (0.68)

Industry: Health & Medicine > Therapeutic Area > Neurology (0.34)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback