AITopics | probability ratio

Large language models offer transformative potential for healthcare, yet their responsible and equitable development depends critically on a deeper understanding of how training data characteristics influence model behavior, including the potential for bias. Current practices in dataset curation and bias assessment often lack the necessary transparency, creating an urgent need for comprehensive evaluation frameworks to foster trust and guide improvements. In this study, we present an in-depth analysis of potential downstream biases in clinical language models, with a focus on differential opioid prescription tendencies across diverse demographic groups, such as ethnicity, gender, and age. As part of this investigation, we introduce HC4: Healthcare Comprehensive Commons Corpus, a novel and extensively curated pretraining dataset exceeding 89 billion tokens. Our evaluation leverages both established general benchmarks and a novel, healthcare-specific methodology, offering crucial insights to support fairness and safety in clinical AI applications.

large language model, machine learning, natural language, (16 more...)

arXiv.org Artificial Intelligence

2510.18556

Country: North America > United States (0.94)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (0.93)

Industry:

Health & Medicine > Therapeutic Area > Oncology (1.00)
Health & Medicine > Pharmaceuticals & Biotechnology (1.00)
Health & Medicine > Consumer Health (1.00)
Government > Regional Government > North America Government > United States Government (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.50)

Add feedback

Supplementary Material

Neural Information Processing SystemsOct-2-2025, 18:33:41 GMT

Here we elaborate on the details of using SNFs as a variational approximation of the posterior distribution of a variational autoencoder (V AE) [21] as presented in our last results section. All experiments were run using PyTorch 1.2 and on GTX1080Ti cards. NSF block consists of two subsequent NSF layers with intermediate swap layers. "Biased data" is defined by running local Metropolis MC in each of the two wells. "Unbiased data" is produced by running Metropolis MC with a large proposal step (standard The other settings are the same as in Table 1.

artificial intelligence, machine learning, transformation, (18 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

Add feedback

3eb46aa5d93b7a5939616af91addfa88-Paper.pdf

Neural Information Processing SystemsOct-2-2025, 18:01:41 GMT

arxiv preprint arxiv, machine learning, natural language, (15 more...)

Neural Information Processing Systems

Country: North America (0.28)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

3eb46aa5d93b7a5939616af91addfa88-AuthorFeedback.pdf

Neural Information Processing SystemsOct-2-2025, 18:01:31 GMT

artificial intelligence, probability ratio, text generation, (17 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence (0.37)

Add feedback

Neural simulation-based inference of the Higgs trilinear self-coupling via off-shell Higgs production

Ghosh, Aishik, Griese, Maximilian, Haisch, Ulrich, Park, Tae Hyoun

arXiv.org Machine LearningJul-4-2025

One of the forthcoming major challenges in particle physics is the experimental determination of the Higgs trilinear self-coupling. While efforts have largely focused on on-shell double- and single-Higgs production in proton-proton collisions, off-shell Higgs production has also been proposed as a valuable complementary probe. In this article, we design a hybrid neural simulation-based inference (NSBI) approach to construct a likelihood of the Higgs signal incorporating modifications from the Standard Model effective field theory (SMEFT), relevant background processes, and quantum interference effects. It leverages the training efficiency of matrix-element-enhanced techniques, which are vital for robust SMEFT applications, while also incorporating the practical advantages of classification-based methods for effective background estimates. We demonstrate that our NSBI approach achieves sensitivity close to the theoretical optimum and provide expected constraints for the high-luminosity upgrade of the Large Hadron Collider. While we primarily concentrate on the Higgs trilinear self-coupling, we also consider constraints on other SMEFT operators that affect off-shell Higgs production.

artificial intelligence, arxiv, machine learning, (16 more...)

arXiv.org Machine Learning

2507.02032

Country:

North America > United States > California > Orange County > Irvine (0.14)
North America > United States > California > Alameda County > Berkeley (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
Europe > Germany (0.04)

Genre: Research Report (0.82)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.46)

Add feedback

The "Law" of the Unconscious Contrastive Learner: Probabilistic Alignment of Unpaired Modalities

Che, Yongwei, Eysenbach, Benjamin

arXiv.org Machine LearningJan-20-2025

While internet-scale data often comes in pairs (e.g., audio/image, image/text), we often want to perform inferences over modalities unseen together in the training data (e.g., audio/text). Empirically, this can often be addressed by learning multiple contrastive embedding spaces between existing modality pairs, implicitly hoping that unseen modality pairs will end up being aligned. This theoretical paper proves that this hope is well founded, under certain assumptions. Starting with the proper Bayesian approach of integrating out intermediate modalities, we show that directly comparing the representations of data from unpaired modalities can recover the same likelihood ratio. Our analysis builds on prior work on the geometry and probabilistic interpretation of contrastive representations, showing how these representations can answer many of the same inferences as probabilistic graphical models. Our analysis suggests two new ways of using contrastive representations: in settings with pre-trained contrastive models, and for handling language ambiguity in reinforcement learning. Our numerical experiments study the importance of our assumptions and demonstrate these new applications. Much of the appeal of contrastive learning is that it gives a "plug-n-play" approach for swapping one modality for another. Because representations from different modalities are trained to be aligned when representing the same object, the hope is that (say) a language representation and image representation of the same scene can be used as substitutes. This property is practically appealing for at least two reasons. First, it allows us to make use of pre-trained models. If you have a model that wants to make use of (say) language input and you have access to a pre-trained image-language contrastive model, you might simply train your model on the pre-trained image representations and hope that it will continue to work when you swap in the language representations.

artificial intelligence, machine learning, representation, (19 more...)

arXiv.org Machine Learning

2501.11326

Genre: Research Report > New Finding (0.46)

Industry:

Government (0.68)
Law (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty > Bayesian Inference (0.34)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (0.34)

Add feedback

De-mark: Watermark Removal in Large Language Models

Chen, Ruibo, Wu, Yihan, Guo, Junfeng, Huang, Heng

arXiv.org Artificial IntelligenceOct-17-2024

Watermarking techniques offer a promising way to identify machine-generated content via embedding covert information into the contents generated from language models (LMs). However, the robustness of the watermarking schemes has not been well explored. In this paper, we present De-mark, an advanced framework designed to remove n-gram-based watermarks effectively. Our method utilizes a novel querying strategy, termed random selection probing, which aids in assessing the strength of the watermark and identifying the red-green list within the n-gram watermark. Experiments on popular LMs, such as Llama3 and ChatGPT, demonstrate the efficiency and effectiveness of De-mark in watermark removal and exploitation tasks.

large language model, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2410.13808

Country: