srcc
Towards SISO Bistatic Sensing for ISAC
Wang, Zhongqin, Zhang, J. Andrew, Wu, Kai, Xu, Min, Guo, Y. Jay
Integrated Sensing and Communication (ISAC) is a key enabler for next-generation wireless systems. However, real-world deployment is often limited to low-cost, single-antenna transceivers. In such bistatic Single-Input Single-Output (SISO) setup, clock asynchrony introduces random phase offsets in Channel State Information (CSI), which cannot be mitigated using conventional multi-antenna methods. This work proposes WiDFS 3.0, a lightweight bistatic SISO sensing framework that enables accurate delay and Doppler estimation from distorted CSI by effectively suppressing Doppler mirroring ambiguity. It operates with only a single antenna at both the transmitter and receiver, making it suitable for low-complexity deployments. We propose a self-referencing cross-correlation (SRCC) method for SISO random phase removal and employ delay-domain beamforming to resolve Doppler ambiguity. The resulting unambiguous delay-Doppler-time features enable robust sensing with compact neural networks. Extensive experiments show that WiDFS 3.0 achieves accurate parameter estimation, with performance comparable to or even surpassing that of prior multi-antenna methods, especially in delay estimation. Validated under single- and multi-target scenarios, the extracted ambiguity-resolved features show strong sensing accuracy and generalization. For example, when deployed on the embedded-friendly MobileViT-XXS with only 1.3M parameters, WiDFS 3.0 consistently outperforms conventional features such as CSI amplitude, mirrored Doppler, and multi-receiver aggregated Doppler.
WhisQ: Cross-Modal Representation Learning for Text-to-Music MOS Prediction
Emon, Jakaria Islam, Alam, Kazi Tamanna, Salek, Md. Abu
--Mean Opinion Score (MOS) prediction for text-to-music systems requires evaluating both overall musical quality and text-prompt alignment. This paper introduces WhisQ, a multimodal architecture that addresses this dual-assessment challenge through sequence-level co-attention and optimal transport regularization. WhisQ employs the Whisper-Base pretrained model for temporal audio encoding and Qwen-3, a 0.6B Small Language Model (SLM), for text encoding, with both maintaining sequence structure for fine-grained cross-modal modeling. The architecture features specialized prediction pathways: OMQ is predicted from pooled audio embeddings, while T A leverages bidirectional sequence co-attention between audio and text. Sinkhorn optimal transport loss further enforce semantic alignment in the shared embedding space. On the MusicEval Track-1 dataset, WhisQ achieves substantial improvements over the baseline: 7% improvement in Spearman correlation for OMQ and 14% for T A. Ablation studies reveal that optimal transport regularization provides the largest performance gain (10% SRCC improvement), demonstrating the importance of explicit cross-modal alignment for text-to-music evaluation.
Deep Learning-based Non-Intrusive Multi-Objective Speech Assessment Model with Cross-Domain Features
Zezario, Ryandhimas E., Fu, Szu-Wei, Chen, Fei, Fuh, Chiou-Shann, Wang, Hsin-Min, Tsao, Yu
Abstract--In this study, we propose a cross-domain multiobjective 2.478 in unseen noise environments) over a CNN-based baseline speech assessment model called MOSA-Net, which SE model. Index Terms--non-intrusive speech assessment models, deep More specifically, MOSA-Net is designed to estimate the speech learning, multi-objective learning, speech enhancement. PEECH assessment metrics are indicators that quantitatively measure the specific attributes of speech signals. LCC by 0.021 (0.985 vs 0.964 in seen noise environments) For example, QIA-SE can improve PESQ by 0.301 Ryandhimas E. Zezario is with the Department of Computer Science and Fei Chen is with the Department of Electrical and Electronic Engineering, Southern University of Science and Technology of China, Shenzhen, China. Hsin-Min Wang is with the Institute of Information Science, Academia Sinica, Taipei, Taiwan. This testing strategy is prohibitive To attain a higher assessment accuracy, the MBNet adopts the and may not always be feasible. Hence, several objective BiasNet architecture to compensate for the biased scores of a evaluations metrics have been developed as surrogates for certain judge [49], In addition, the multi-task learning criterion human listening tests [6]-[31]. Meanwhile, different acoustic comprises two stages. The first stage includes a series of features are used as input to the assessment model to consider signal processing units designed to convert speech waveforms information from different acoustic domains [51], [52].
Uncertainty as a Predictor: Leveraging Self-Supervised Learning for Zero-Shot MOS Prediction
Ravuri, Aditya, Cooper, Erica, Yamagishi, Junichi
This paper addresses the gap in We are particularly inspired by approaches in biology where efficient audio quality prediction, especially in low-resource zero-shot prediction is possible using a model's uncertainty settings where extensive MOS data from large-scale listening estimates, where uncertainties act as proxies for downstream tests may be unavailable. We demonstrate that uncertainty tasks [4]. Our main hypotheses are that, measures derived from out-of-the-box pretrained selfsupervised learning (SSL) models, such as wav2vec, correlate 1. uncertainty estimates can be derived from the outputs with MOS scores. These findings are based on data from the of SSL models such as wav2vec, and that, 2022 and 2023 VoiceMOS challenges. We explore the extent 2. these uncertainties can be used as proxies to MOS of this correlation across different models and language scores as high model uncertainty around the contents contexts, revealing insights into how inherent uncertainties in of an audio sequence must correspond to low audio SSL models can serve as effective proxies for audio quality quality.
SFCNeXt: a simple fully convolutional network for effective brain age estimation with small sample size
Fu, Yu, Huang, Yanyan, Dong, Shunjie, Wang, Yalin, Yu, Tianbai, Niu, Meng, Zhuo, Cheng
Deep neural networks (DNN) have been designed to predict the chronological age of a healthy brain from T1-weighted magnetic resonance images (T1 MRIs), and the predicted brain age could serve as a valuable biomarker for the early detection of development-related or aging-related disorders. Recent DNN models for brain age estimations usually rely too much on large sample sizes and complex network structures for multi-stage feature refinement. However, in clinical application scenarios, researchers usually cannot obtain thousands or tens of thousands of MRIs in each data center for thorough training of these complex models. This paper proposes a simple fully convolutional network (SFCNeXt) for brain age estimation in small-sized cohorts with biased age distributions. The SFCNeXt consists of Single Pathway Encoded ConvNeXt (SPEC) and Hybrid Ranking Loss (HRL), aiming to estimate brain ages in a lightweight way with a sufficient exploration of MRI, age, and ranking features of each batch of subjects. Experimental results demonstrate the superiority and efficiency of our approach.
ElectrodeNet -- A Deep Learning Based Sound Coding Strategy for Cochlear Implants
Huang, Enoch Hsin-Ho, Chao, Rong, Tsao, Yu, Wu, Chao-Min
ElectrodeNet, a deep learning based sound coding strategy for the cochlear implant (CI), is proposed to emulate the advanced combination encoder (ACE) strategy by replacing the conventional envelope detection using various artificial neural networks. The extended ElectrodeNet-CS strategy further incorporates the channel selection (CS). Network models of deep neural network (DNN), convolutional neural network (CNN), and long short-term memory (LSTM) were trained using the Fast Fourier Transformed bins and channel envelopes obtained from the processing of clean speech by the ACE strategy. Objective speech understanding using short-time objective intelligibility (STOI) and normalized covariance metric (NCM) was estimated for ElectrodeNet using CI simulations. Sentence recognition tests for vocoded Mandarin speech were conducted with normal-hearing listeners. DNN, CNN, and LSTM based ElectrodeNets exhibited strong correlations to ACE in objective and subjective scores using mean squared error (MSE), linear correlation coefficient (LCC) and Spearman's rank correlation coefficient (SRCC). The ElectrodeNet-CS strategy was capable of producing N-of-M compatible electrode patterns using a modified DNN network to embed maxima selection, and to perform in similar or even slightly higher average in STOI and sentence recognition compared to ACE. The methods and findings demonstrated the feasibility and potential of using deep learning in CI coding strategy.
AIO-P: Expanding Neural Performance Predictors Beyond Image Classification
Mills, Keith G., Niu, Di, Salameh, Mohammad, Qiu, Weichen, Han, Fred X., Liu, Puyuan, Zhang, Jialin, Lu, Wei, Jui, Shangling
Evaluating neural network performance is critical to deep neural network design but a costly procedure. Neural predictors provide an efficient solution by treating architectures as samples and learning to estimate their performance on a given task. However, existing predictors are task-dependent, predominantly estimating neural network performance on image classification benchmarks. They are also search-space dependent; each predictor is designed to make predictions for a specific architecture search space with predefined topologies and set of operations. In this paper, we propose a novel All-in-One Predictor (AIO-P), which aims to pretrain neural predictors on architecture examples from multiple, separate computer vision (CV) task domains and multiple architecture spaces, and then transfer to unseen downstream CV tasks or neural architectures. We describe our proposed techniques for general graph representation, efficient predictor pretraining and knowledge infusion techniques, as well as methods to transfer to downstream tasks/spaces. Extensive experimental results show that AIO-P can achieve Mean Absolute Error (MAE) and Spearman's Rank Correlation (SRCC) below 1% and above 0.5, respectively, on a breadth of target downstream CV tasks with or without fine-tuning, outperforming a number of baselines. Moreover, AIO-P can directly transfer to new architectures not seen during training, accurately rank them and serve as an effective performance estimator when paired with an algorithm designed to preserve performance while reducing FLOPs.
HASA-net: A non-intrusive hearing-aid speech assessment network
Chiang, Hsin-Tien, Wu, Yi-Chiao, Yu, Cheng, Toda, Tomoki, Wang, Hsin-Min, Hu, Yih-Chun, Tsao, Yu
Without the need of a clean reference, non-intrusive speech assessment methods have caught great attention for objective evaluations. Recently, deep neural network (DNN) models have been applied to build non-intrusive speech assessment approaches and confirmed to provide promising performance. However, most DNN-based approaches are designed for normal-hearing listeners without considering hearing-loss factors. In this study, we propose a DNN-based hearing aid speech assessment network (HASA-Net), formed by a bidirectional long short-term memory (BLSTM) model, to predict speech quality and intelligibility scores simultaneously according to input speech signals and specified hearing-loss patterns. To the best of our knowledge, HASA-Net is the first work to incorporate quality and intelligibility assessments utilizing a unified DNN-based non-intrusive model for hearing aids. Experimental results show that the predicted speech quality and intelligibility scores of HASA-Net are highly correlated to two well-known intrusive hearing-aid evaluation metrics, hearing aid speech quality index (HASQI) and hearing aid speech perception index (HASPI), respectively.
Are We Making Real Progress in Simulated Environments? Measuring the Sim2Real Gap in Embodied Visual Navigation
Kadian, Abhishek, Truong, Joanne, Gokaslan, Aaron, Clegg, Alexander, Wijmans, Erik, Lee, Stefan, Savva, Manolis, Chernova, Sonia, Batra, Dhruv
Does progress in simulation translate to progress in robotics? Specifically, if method A outperforms method B in simulation, how likely is the trend to hold in reality on a robot? We examine this question for embodied (PointGoal) navigation, developing engineering tools and a research paradigm for evaluating a simulator by its sim2real predictivity, revealing surprising findings about prior work. First, we develop Habitat-PyRobot Bridge (HaPy), a library for seamless execution of identical code on a simulated agent and a physical robot. Habitat-to-Locobot transfer with HaPy involves just one line change in config, essentially treating reality as just another simulator! Second, we investigate sim2real predictivity of Habitat-Sim for PointGoal navigation. We 3D-scan a physical lab space to create a virtualized replica, and run parallel tests of 9 different models in reality and simulation. We present a new metric called Sim-vs-Real Correlation Coefficient (SRCC) to quantify sim2real predictivity. Our analysis reveals several important findings. We find that SRCC for Habitat as used for the CVPR19 challenge is low (0.18 for the success metric), which suggests that performance improvements for this simulator-based challenge would not transfer well to a physical robot. We find that this gap is largely due to AI agents learning to 'cheat' by exploiting simulator imperfections: specifically, the way Habitat allows for 'sliding' along walls on collision. Essentially, the virtual robot is capable of cutting corners, leading to unrealistic shortcuts through non-navigable spaces. Naturally, such exploits do not work in the real world where the robot stops on contact with walls. Our experiments show that it is possible to optimize simulation parameters to enable robots trained in imperfect simulators to generalize learned skills to reality (e.g. improving $SRCC_{Succ}$ from 0.18 to 0.844).
A Measure of Similarity in Textual Data Using Spearman's Rank Correlation Coefficient
Arsov, Nino, Dukovski, Milan, Evkoski, Blagoja, Cvetkovski, Stefan
In the last decade, many diverse advances have occurred in the field of information extraction from data. Information extraction in its simplest form takes place in computing environments, where structured data can be extracted through a series of queries. The continuous expansion of quantities of data have therefore provided an opportunity for knowledge extraction (KE) from a textual document (TD). A typical problem of this kind is the extraction of common characteristics and knowledge from a group of TDs, with the possibility to group such similar TDs in a process known as clustering. In this paper we present a technique for such KE among a group of TDs related to the common characteristics and meaning of their content. Our technique is based on the Spearman's Rank Correlation Coefficient (SRCC), for which the conducted experiments have proven to be comprehensive measure to achieve a high-quality KE.