Goto

Collaborating Authors

 speech evaluation


Automated evaluation of children's speech fluency for low-resource languages

arXiv.org Artificial Intelligence

Assessment of children's speaking fluency in education is well researched for majority languages, but remains highly challenging for low resource languages. This paper propose s a system to automatically assess fluency by combining a fine-tuned multilingual ASR model, an objective metrics extract ion stage, and a generative pre-trained transformer (GPT) netw ork. The objective metrics include phonetic and word error rates, speech rate, and speech-pause duration ratio. These are interpreted by a GPT -based classifier guided by a small set of human-evaluated ground truth examples, to score fluency. We evaluate the proposed system on a dataset of children's spee ch in two low-resource languages, Tamil and Malay and compare the classification performance against Random Forest and XG - Boost, as well as using ChatGPT -4o to predict fluency directl y from speech input. Results demonstrate that the proposed ap - proach achieves significantly higher accuracy than multimo dal GPT or other methods.


Beyond Survival: Evaluating LLMs in Social Deduction Games with Human-Aligned Strategies

arXiv.org Artificial Intelligence

Social deduction games like Werewolf combine language, reasoning, and strategy, providing a testbed for studying natural language and social intelligence. However, most studies reduce the game to LLM-based self-play, yielding templated utterances and anecdotal cases that overlook the richness of social gameplay. Evaluation further relies on coarse metrics such as survival time or subjective scoring due to the lack of quality reference data. To address these gaps, we curate a high-quality, human-verified multimodal Werewolf dataset containing over 100 hours of video, 32.4M utterance tokens, and 15 rule variants. Based on this dataset, we propose a novel strategy-alignment evaluation that leverages the winning faction's strategies as ground truth in two stages: 1) Speech evaluation, formulated as multiple-choice-style tasks that assess whether the model can adopt appropriate stances across five dimensions of social ability; and 2) Decision evaluation, which assesses the model's voting choices and opponent-role inferences. This framework enables a fine-grained evaluation of models' linguistic and reasoning capabilities, while capturing their ability to generate strategically coherent gameplay. Our experiments show that state-of-the-art LLMs show diverse performance, with roughly half remain below 0.50, revealing clear gaps in deception and counterfactual reasoning. We hope our dataset further inspires research on language, reasoning, and strategy in multi-agent interaction.


Semi-supervised Learning For Robust Speech Evaluation

arXiv.org Artificial Intelligence

Speech evaluation measures a learners oral proficiency using automatic models. Corpora for training such models often pose sparsity challenges given that there often is limited scored data from teachers, in addition to the score distribution across proficiency levels being often imbalanced among student cohorts. Automatic scoring is thus not robust when faced with under-represented samples or out-of-distribution samples, which inevitably exist in real-world deployment scenarios. This paper proposes to address such challenges by exploiting semi-supervised pre-training and objective regularization to approximate subjective evaluation criteria. In particular, normalized mutual information is used to quantify the speech characteristics from the learner and the reference. An anchor model is trained using pseudo labels to predict the correctness of pronunciation. An interpolated loss function is proposed to minimize not only the prediction error with respect to ground-truth scores but also the divergence between two probability distributions estimated by the speech evaluation model and the anchor model. Compared to other state-of-the-art methods on a public data-set, this approach not only achieves high performance while evaluating the entire test-set as a whole, but also brings the most evenly distributed prediction error across distinct proficiency levels. Furthermore, empirical results show the model accuracy on out-of-distribution data also compares favorably with competitive baselines.