standard test
BabyHuBERT: Multilingual Self-Supervised Learning for Segmenting Speakers in Child-Centered Long-Form Recordings
Charlot, Théo, Kunze, Tarek, Poli, Maxime, Cristia, Alejandrina, Dupoux, Emmanuel, Lavechin, Marvin
Child-centered long-form recordings are essential for studying early language development, but existing speech models trained on clean adult data perform poorly due to acoustic and linguistic differences. We introduce BabyHuBERT, the first self-supervised speech representation model trained on 13,000 hours of multilingual child-centered long-form recordings spanning over 40 languages. We evaluate BabyHuBERT on speaker segmentation, identifying when target children speak versus female adults, male adults, or other children -- a fundamental preprocessing step for analyzing naturalistic language experiences. BabyHuBERT achieves F1-scores from 52.1% to 74.4% across six diverse datasets, consistently outperforming W2V2-LL4300 (trained on English long-forms) and standard HuBERT (trained on clean adult speech). Notable improvements include 13.2 absolute F1 points over HuBERT on Vanuatu and 15.9 points on Solomon Islands corpora, demonstrating effectiveness on underrepresented languages. By sharing code and models, BabyHuBERT serves as a foundation model for child speech research, enabling fine-tuning on diverse downstream tasks.
R-Judge: Benchmarking Safety Risk Awareness for LLM Agents
Yuan, Tongxin, He, Zhiwei, Dong, Lingzhong, Wang, Yiming, Zhao, Ruijie, Xia, Tian, Xu, Lizhen, Zhou, Binglin, Li, Fangqi, Zhang, Zhuosheng, Wang, Rui, Liu, Gongshen
Large language models (LLMs) have exhibited great potential in autonomously completing tasks across real-world applications. Despite this, these LLM agents introduce unexpected safety risks when operating in interactive environments. Instead of centering on LLM-generated content safety in most prior studies, this work addresses the imperative need for benchmarking the behavioral safety of LLM agents within diverse environments. We introduce R-Judge, a benchmark crafted to evaluate the proficiency of LLMs in judging safety risks given agent interaction records. R-Judge comprises 162 agent interaction records, encompassing 27 key risk scenarios among 7 application categories and 10 risk types. It incorporates human consensus on safety with annotated safety risk labels and high-quality risk descriptions. Utilizing R-Judge, we conduct a comprehensive evaluation of 8 prominent LLMs commonly employed as the backbone for agents. The best-performing model, GPT-4, achieves 72.29% in contrast to the human score of 89.38%, showing considerable room for enhancing the risk awareness of LLMs. Notably, leveraging risk descriptions as environment feedback significantly improves model performance, revealing the importance of salient safety risk feedback. Furthermore, we design an effective chain of safety analysis technique to help the judgment of safety risks and conduct an in-depth case study to facilitate future research. R-Judge is publicly available at https://github.com/Lordog/R-Judge.
Artificial intelligence 'better at diagnosing heart failure' than standard test
Dr Ken Lee, cardiology specialist registrar and clinical lecturer at Edinburgh University, said: "Heart failure can be a very challenging diagnosis to make in practice. "We have shown that CoDE-HF, our decision-support tool, can substantially improve the accuracy of diagnosing heart failure compared to current blood tests." Previous research has shown that patients who are diagnosed quickly benefit the most from treatment. Acute heart failure affects nearly one million people in the UK and accounts for five per cent of all unplanned hospital admissions. The prevalence is projected to rise by approximately 50% over the next 25 years owing to the ageing population. It is a sudden, life-threatening condition caused when the heart is suddenly unable to pump enough oxygen-rich blood around the body to meet its needs. It can be brought on by coronary heart disease – where the arteries become blocked, limiting blood flow – or by other ongoing conditions such as diabetes which damage cardiac ...
IBM's latest AI predicts Alzheimer's better than standard tests
IBM has developed a new AI model which predicts the onset of Alzheimer's better than standard clinical tests. The AI is designed to be non-invasive and uses a short language sample from a verbal cognitive test given to a patient. Using this sample, the AI model is able to predict the onset of Alzheimer's with around 71 percent accuracy. For comparison, standard clinical tests are correct approximately 59 percent of the time and take much longer to diagnose. Current tests analyse the descriptive abilities of people as they age for potential warning signs.
Testing for Normality with Neural Networks
In this paper, we treat the problem of testing for normality as a binary classification problem and construct a feedforward neural network that can successfully detect normal distributions by inspecting small samples from them. The numerical experiments conducted on small samples with no more than 100 elements indicated that the neural network which we trained was more accurate and far more powerful than the most frequently used and most powerful standard tests of normality: Shapiro-Wilk, Anderson-Darling, Lilliefors and Jarque-Berra, as well as the kernel tests of goodness-of-fit. The neural network had the AUROC score of almost 1, which corresponds to the perfect binary classifier. Additionally, the network's accuracy was higher than 96% on a set of larger samples with 250-1000 elements. Since the normality of data is an assumption of numerous techniques for analysis and inference, the neural network constructed in this study has a very high potential for use in everyday practice of statistics, data analysis and machine learning in both science and industry.
Question Difficulty Prediction for READING Problems in Standard Tests
Huang, Zhenya (University of Science and Technology of China) | Liu, Qi (University of Science and Technology of China) | Chen, Enhong (University of Science and Technology of China) | Zhao, Hongke (University of Science and Technology of China) | Gao, Mingyong ( iFLYTEK Co., Ltd. ) | Wei, Si ( iFLYTEK Co., Ltd. ) | Su, Yu (Anhui University) | Hu, Guoping ( iFLYTEK Co., Ltd. )
Standard tests aim to evaluate the performance of examinees using different tests with consistent difficulties. Thus, a critical demand is to predict the difficulty of each test question before the test is conducted. Existing studies are usually based on the judgments of education experts (e.g., teachers), which may be subjective and labor intensive. In this paper, we propose a novel Test-aware Attention-based Convolutional Neural Network (TACNN) framework to automatically solve this Question Difficulty Prediction (QDP) task for READING problems (a typical problem style in English tests) in standard tests. Specifically, given the abundant historical test logs and text materials of questions, we first design a CNN-based architecture to extract sentence representations for the questions. Then, we utilize an attention strategy to qualify the difficulty contribution of each sentence to questions. Considering the incomparability of question difficulties in different tests, we propose a test-dependent pairwise strategy for training TACNN and generating the difficulty prediction value. Extensive experiments on a real-world dataset not only show the effectiveness of TACNN, but also give interpretable insights to track the attention information for questions.
AI scores higher than the average person on standard test
Artificial intelligence can now outperform humans on a standard intelligence test. A new computational model scores within the 75th percentile, better than the average person, on a test known as Raven's Progressive Matrices. Researchers say this demonstrates that it can take on abstract visual reasoning tasks, and is a major step toward AI that can see and understand the world the way we do. Using Raven's Progressive Matrices, a nonverbal standardized test that measures abstract reasoning, the team found that their model is not only on par with humans, but performs better than many. In this example, participants choose which shape should come next in the sequence.
Validation of nonlinear PCA
Linear principal component analysis (PCA) can be extended to a nonlinear PCA by using artificial neural networks. But the benefit of curved components requires a careful control of the model complexity. Moreover, standard techniques for model selection, including cross-validation and more generally the use of an independent test set, fail when applied to nonlinear PCA because of its inherent unsupervised characteristics. This paper presents a new approach for validating the complexity of nonlinear PCA models by using the error in missing data estimation as a criterion for model selection. It is motivated by the idea that only the model of optimal complexity is able to predict missing values with the highest accuracy. While standard test set validation usually favours over-fitted nonlinear PCA models, the proposed model validation approach correctly selects the optimal model complexity.