Goto

Collaborating Authors

 youden



HyperCore: Coreset Selection under Noise via Hypersphere Models

Moser, Brian B., Shanbhag, Arundhati S., Nauen, Tobias C., Frolov, Stanislav, Raue, Federico, Folz, Joachim, Dengel, Andreas

arXiv.org Artificial Intelligence

The goal of coreset selection methods is to identify representative subsets of datasets for efficient model training. Yet, existing methods often ignore the possibility of annotation errors and require fixed pruning ratios, making them impractical in real-world settings. We present HyperCore, a robust and adaptive coreset selection framework designed explicitly for noisy environments. HyperCore leverages lightweight hypersphere models learned per class, embedding in-class samples close to a hypersphere center while naturally segregating out-of-class samples based on their distance. By using Youden's J statistic, HyperCore can adaptively select pruning thresholds, enabling automatic, noise-aware data pruning without hyperparameter tuning. Our experiments reveal that HyperCore consistently surpasses state-of-the-art coreset selection methods, especially under noisy and low-data regimes. HyperCore effectively discards mislabeled and ambiguous points, yielding compact yet highly informative subsets suitable for scalable and noise-free learning.


VietBinoculars: A Zero-Shot Approach for Detecting Vietnamese LLM-Generated Text

Nguyen, Trieu Hai, Akilesh, Sivaswamy

arXiv.org Artificial Intelligence

The rapid development research of Large Language Models (LLMs) based on transformer architectures raises key challenges, one of them being the task of distinguishing between human-written text and LLM-generated text. As LLM-generated textual content, becomes increasingly complex over time, and resembles human writing, traditional detection methods are proving less effective, especially as the number and diversity of LLMs continue to grow with new models and versions being released at a rapid pace. This study proposes VietBinoculars, an adaptation of the Binoculars method with optimized global thresholds, to enhance the detection of Vietnamese LLM-generated text. We have constructed new Vietnamese AI-generated datasets to determine the optimal thresholds for VietBinoculars and to enable benchmarking. The results from our experiments show results show that VietBinoculars achieves over 99\% in all two domains of accuracy, F1-score, and AUC on multiple out-of-domain datasets. It outperforms the original Binoculars model, traditional detection methods, and other state-of-the-art approaches, including commercial tools such as ZeroGPT and DetectGPT, especially under specially modified prompting strategies.


Semantic Similarity-Informed Bayesian Borrowing for Quantitative Signal Detection of Adverse Events

Haguinet, François, Painter, Jeffery L, Powell, Gregory E, Callegaro, Andrea, Bate, Andrew

arXiv.org Artificial Intelligence

We present a Bayesian dynamic borrowing (BDB) approach to enhance the quantitative identification of adverse events (AEs) in spontaneous reporting systems (SRSs). The method embeds a robust meta-analytic predictive (MAP) prior with a Bayesian hierarchical model and incorporates semantic similarity measures (SSMs) to enable weighted information sharing from clinically similar MedDRA Preferred Terms (PTs) to the target PT. This continuous similarity-based borrowing overcomes limitations of rigid hierarchical grouping in current disproportionality analysis (DPA). Using data from the FDA Adverse Event Reporting System (FAERS) between 2015 and 2019, we evaluate our approach -- termed IC SSM -- against traditional Information Component (IC) analysis and IC with borrowing at the MedDRA high-level group term level (IC HLGT). A reference set (PVLens), derived from FDA product label update, enabled prospective evaluation of method performance in identifying AEs prior to official labeling. The IC SSM approach demonstrated higher sensitivity (1332/2337=0.570, Youden's J=0.246) than traditional IC (Se=0.501, J=0.250) and IC HLGT (Se=0.556, J=0.225), consistently identifying more true positives and doing so on average 5 months sooner than traditional IC. Despite a marginally lower aggregate F1-score and Youden's index, IC SSM showed higher performance in early post-marketing periods or when the detection threshold was raised, providing more stable and relevant alerts than IC HLGT and traditional IC. These findings support the use of SSM-informed Bayesian borrowing as a scalable and context-aware enhancement to traditional DPA methods, with potential for validation across other datasets and exploration of additional similarity metrics and Bayesian strategies using case-level data.


A Cross-lingual Natural Language Processing Framework for Infodemic Management

Pal, Ridam, Pandey, Rohan, Gautam, Vaibhav, Bhagat, Kanav, Sethi, Tavpritesh

arXiv.org Artificial Intelligence

The COVID-19 pandemic has put immense pressure on health systems which are further strained due to the misinformation surrounding it. Under such a situation, providing the right information at the right time is crucial. There is a growing demand for the management of information spread using Artificial Intelligence. Hence, we have exploited the potential of Natural Language Processing for identifying relevant information that needs to be disseminated amongst the masses. In this work, we present a novel Cross-lingual Natural Language Processing framework to provide relevant information by matching daily news with trusted guidelines from the World Health Organization. The proposed pipeline deploys various techniques of NLP such as summarizers, word embeddings, and similarity metrics to provide users with news articles along with a corresponding healthcare guideline. A total of 36 models were evaluated and a combination of LexRank based summarizer on Word2Vec embedding with Word Mover distance metric outperformed all other models. This novel open-source approach can be used as a template for proactive dissemination of relevant healthcare information in the midst of misinformation spread associated with epidemics.