Goto

Collaborating Authors

 Performance Analysis


Partial Identification Approach to Counterfactual Fairness Assessment

arXiv.org Artificial Intelligence

The wide adoption of AI decision-making systems in critical domains such as criminal justice, loan approval, and hiring processes has heightened concerns about algorithmic fairness. As we often only have access to the output of algorithms without insights into their internal mechanisms, it was natural to examine how decisions would alter when auxiliary sensitive attributes (such as race) change. This led the research community to come up with counterfactual fairness measures, but how to evaluate the measure from available data remains a challenging task. In many practical applications, the target counterfactual measure is not identifiable, i.e., it cannot be uniquely determined from the combination of quantitative data and qualitative knowledge. This paper addresses this challenge using partial identification, which derives informative bounds over counterfactual fairness measures from observational data. We introduce a Bayesian approach to bound unknown counterfactual fairness measures with high confidence. We demonstrate our algorithm on the COMPAS dataset, examining fairness in recidivism risk scores with respect to race, age, and sex. Our results reveal a positive (spurious) effect on the COMPAS score when changing race to African-American (from all others) and a negative (direct causal) effect when transitioning from young to old age.


Optimizing What Matters: AUC-Driven Learning for Robust Neural Retrieval

arXiv.org Artificial Intelligence

Dual-encoder retrievers depend on the principle that relevant documents should score higher than irrelevant ones for a given query. Yet the dominant Noise Contrastive Estimation (NCE) objective, which underpins Contrastive Loss, optimizes a softened ranking surrogate that we rigorously prove is fundamentally oblivious to score separation quality and unrelated to AUC. This mismatch leads to poor calibration and suboptimal performance in downstream tasks like retrieval-augmented generation (RAG). To address this fundamental limitation, we introduce the MW loss, a new training objective that maximizes the Mann-Whitney U statistic, which is mathematically equivalent to the Area under the ROC Curve (AUC). MW loss encourages each positive-negative pair to be correctly ranked by minimizing binary cross entropy over score differences. We provide theoretical guarantees that MW loss directly upper-bounds the AoC, better aligning optimization with retrieval goals. We further promote ROC curves and AUC as natural threshold free diagnostics for evaluating retriever calibration and ranking quality. Empirically, retrievers trained with MW loss consistently outperform contrastive counterparts in AUC and standard retrieval metrics. Our experiments show that MW loss is an empirically superior alternative to Contrastive Loss, yielding better-calibrated and more discriminative retrievers for high-stakes applications like RAG.


Revealing the temporal dynamics of antibiotic anomalies in the infant gut microbiome with neural jump ODEs

arXiv.org Artificial Intelligence

Detecting anomalies in irregularly sampled multi-variate time-series is challenging, especially in data-scarce settings. Here we introduce an anomaly detection framework for irregularly sampled time-series that leverages neural jump ordinary differential equations (NJODEs). The method infers conditional mean and variance trajectories in a fully path dependent way and computes anomaly scores. On synthetic data containing jump, drift, diffusion, and noise anomalies, the framework accurately identifies diverse deviations. Applied to infant gut microbiome trajectories, it delineates the magnitude and persistence of antibiotic-induced disruptions: revealing prolonged anomalies after second antibiotic courses, extended duration treatments, and exposures during the second year of life. We further demonstrate the predictive capabilities of the inferred anomaly scores in accurately predicting antibiotic events and outperforming diversity-based baselines. Our approach accommodates unevenly spaced longitudinal observations, adjusts for static and dynamic covariates, and provides a foundation for inferring microbial anomalies induced by perturbations, offering a translational opportunity to optimize intervention regimens by minimizing microbial disruptions.


Survey of AI-Powered Approaches for Osteoporosis Diagnosis in Medical Imaging

arXiv.org Artificial Intelligence

Osteoporosis silently erodes skeletal integrity worldwide; however, early detection through imaging can prevent most fragility fractures. Artificial intelligence (AI) methods now mine routine Dual-energy X-ray Absorptiometry (DXA), X-ray, Computed Tomography (CT), and Magnetic Resonance Imaging (MRI) scans for subtle, clinically actionable markers, but the literature is fragmented. This survey unifies the field through a tri-axial framework that couples imaging modalities with clinical tasks and AI methodologies (classical machine learning, convolutional neural networks (CNNs), transformers, self-supervised learning, and explainable AI). Following a concise clinical and technical primer, we detail our Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA)-guided search strategy, introduce the taxonomy via a roadmap figure, and synthesize cross-study insights on data scarcity, external validation, and interpretability. By identifying emerging trends, open challenges, and actionable research directions, this review provides AI scientists, medical imaging researchers, and musculoskeletal clinicians with a clear compass to accelerate rigorous, patient-centered innovation in osteoporosis care. The project page of this survey can also be found on Github.


A Recall-First CNN for Sleep Apnea Screening from Snoring Audio

arXiv.org Artificial Intelligence

Sleep apnea is a serious sleep-related breathing disorder that is common and can impact health if left untreated. Currently the traditional method for screening and diagnosis is overnight polysomnography. Polysomnography is expensive and takes a lot of time, and is not practical for screening large groups of people. In this paper, we explored a more accessible option, using respiratory audio recordings to spot signs of apnea.We utilized 18 audio files.The approach involved converting breathing sounds into spectrograms, balancing the dataset by oversampling apnea segments, and applying class weights to reduce bias toward the majority class. The model reached a recall of 90.55 for apnea detection. Intentionally, prioritizing catching apnea events over general accuracy. Despite low precision,the high recall suggests potential as a low-cost screening tool that could be used at home or in basic clinical setups, potentially helping identify at-risk individuals much earlier.


WaveMind: Towards a Conversational EEG Foundation Model Aligned to Textual and Visual Modalities

arXiv.org Artificial Intelligence

Electroencephalography (EEG) interpretation using multimodal large language models (MLLMs) offers a novel approach for analyzing brain signals. However, the complex nature of brain activity introduces critical challenges: EEG signals simultaneously encode both cognitive processes and intrinsic neural states, creating a mismatch in EEG paired-data modality that hinders effective cross-modal representation learning. Through a pivot investigation, we uncover complementary relationships between these modalities. Leveraging this insight, we propose mapping EEG signals and their corresponding modalities into a unified semantic space to achieve generalized interpretation. To fully enable conversational capabilities, we further introduce WaveMind-Instruct-338k, the first cross-task EEG dataset for instruction tuning. The resulting model demonstrates robust classification accuracy while supporting flexible, open-ended conversations across four downstream tasks, thereby offering valuable insights for both neuroscience research and the development of general-purpose EEG models.


Are All Marine Species Created Equal? Performance Disparities in Underwater Object Detection

arXiv.org Artificial Intelligence

Underwater object detection is critical for monitoring marine ecosystems but poses unique challenges, including degraded image quality, imbalanced class distribution, and distinct visual characteristics. Not every species is detected equally well, yet underlying causes remain unclear. We address two key research questions: 1) What factors beyond data quantity drive class-specific performance disparities? 2) How can we systematically improve detection of under-performing marine species? We manipulate the DUO and RUOD datasets to separate the object detection task into localization and classification and investigate the under-performance of the scallop class. Localization analysis using YOLO11 and TIDE finds that foreground-background discrimination is the most problematic stage regardless of data quantity. Classification experiments reveal persistent precision gaps even with balanced data, indicating intrinsic feature-based challenges beyond data scarcity and inter-class dependencies. We recommend imbalanced distributions when prioritizing precision, and balanced distributions when prioritizing recall. Improving under-performing classes should focus on algorithmic advances, especially within localization modules. We publicly release our code and datasets.


How Safe Will I Be Given What I Saw? Calibrated Prediction of Safety Chances for Image-Controlled Autonomy

arXiv.org Artificial Intelligence

Autonomous robots that rely on deep neural network controllers pose critical challenges for safety prediction, especially under partial observability and distribution shift. Traditional model-based verification techniques are limited in scalability and require access to low-dimensional state models, while model-free methods often lack reliability guarantees. This paper addresses these limitations by introducing a framework for calibrated safety prediction in end-to-end vision-controlled systems, where neither the state-transition model nor the observation model is accessible. Building on the foundation of world models, we leverage variational autoencoders and recurrent predictors to forecast future latent trajectories from raw image sequences and estimate the probability of satisfying safety properties. We distinguish between monolithic and composite prediction pipelines and introduce a calibration mechanism to quantify prediction confidence. In long-horizon predictions from high-dimensional observations, the forecasted inputs to the safety evaluator can deviate significantly from the training distribution due to compounding prediction errors and changing environmental conditions, leading to miscalibrated risk estimates. To address this, we incorporate unsupervised domain adaptation to ensure robustness of safety evaluation under distribution shift in predictions without requiring manual labels. Our formulation provides theoretical calibration guarantees and supports practical evaluation across long prediction horizons. Experimental results on three benchmarks show that our UDA-equipped evaluators maintain high accuracy and substantially lower false positive rates under distribution shift. Similarly, world model-based composite predictors outperform their monolithic counterparts on long-horizon tasks, and our conformal calibration provides reliable statistical bounds.


Automatic Speech Recognition (ASR) for African Low-Resource Languages: A Systematic Literature Review

arXiv.org Artificial Intelligence

ASR has achieved remarkable global progress, yet African low-resource languages remain rigorously underrepresented, producing barriers to digital inclusion across the continent with more than +2000 languages. This systematic literature review (SLR) explores research on ASR for African languages with a focus on datasets, models and training methods, evaluation techniques, challenges, and recommends future directions. We employ the PRISMA 2020 procedures and search DBLP, ACM Digital Library, Google Scholar, Semantic Scholar, and arXiv for studies published between January 2020 and July 2025. We include studies related to ASR datasets, models or metrics for African languages, while excluding non-African, duplicates, and low-quality studies (score <3/5). We screen 71 out of 2,062 records and we record a total of 74 datasets across 111 languages, encompassing approximately 11,206 hours of speech. Fewer than 15% of research provided reproducible materials, and dataset licensing is not clear. Self-supervised and transfer learning techniques are promising, but are hindered by limited pre-training data, inadequate coverage of dialects, and the availability of resources. Most of the researchers use Word Error Rate (WER), with very minimal use of linguistically informed scores such as Character Error Rate (CER) or Diacritic Error Rate (DER), and thus with limited application in tonal and morphologically rich languages. The existing evidence on ASR systems is inconsistent, hindered by issues like dataset availability, poor annotations, licensing uncertainties, and limited benchmarking. Nevertheless, the rise of community-driven initiatives and methodological advancements indicates a pathway for improvement. Sustainable development for this area will also include stakeholder partnership, creation of ethically well-balanced datasets, use of lightweight modelling techniques, and active benchmarking.


Breaking the Euclidean Barrier: Hyperboloid-Based Biological Sequence Analysis

arXiv.org Artificial Intelligence

Genomic sequence analysis plays a crucial role in various scientific and medical domains. Traditional machine-learning approaches often struggle to capture the complex relationships and hierarchical structures of sequence data when working in high-dimensional Euclidean spaces. This limitation hinders accurate sequence classification and similarity measurement. To address these challenges, this research proposes a method to transform the feature representation of biological sequences into the hyperboloid space. By applying a transformation, the sequences are mapped onto the hyperboloid, preserving their inherent structural information. Once the sequences are represented in the hyperboloid space, a kernel matrix is computed based on the hyperboloid features. The kernel matrix captures the pairwise similarities between sequences, enabling more effective analysis of biological sequence relationships. This approach leverages the inner product of the hyperboloid feature vectors to measure the similarity between pairs of sequences. The experimental evaluation of the proposed approach demonstrates its efficacy in capturing important sequence correlations and improving classification accuracy.