Goto

Collaborating Authors

 model pretrained



Towards Open Respiratory Acoustic Foundation Models: Pretraining and Benchmarking

Neural Information Processing Systems

Respiratory audio, such as coughing and breathing sounds, has predictive power for a wide range of healthcare applications, yet is currently under-explored. The main problem for those applications arises from the difficulty in collecting large labeled task-specific data for model development. Generalizable respiratory acoustic foundation models pretrained with unlabeled data would offer appealing advantages and possibly unlock this impasse. However, given the safety-critical nature of healthcare applications, it is pivotal to also ensure openness and replicability for any proposed foundation model solution. To this end, we introduce OPERA, an OPEn Respiratory Acoustic foundation model pretraining and benchmarking system, as the first approach answering this need. We curate large-scale respiratory audio datasets ($\sim$136K samples, over 400 hours), pretrain three pioneering foundation models, and build a benchmark consisting of 19 downstream respiratory health tasks for evaluation. Our pretrained models demonstrate superior performance (against existing acoustic models pretrained with general audio on 16 out of 19 tasks) and generalizability (to unseen datasets and new respiratory audio modalities). This highlights the great promise of respiratory acoustic foundation models and encourages more studies using OPERA as an open resource to accelerate research on respiratory audio for health.


Self-Supervised Learning by Cross-Modal Audio-Video Clustering

Neural Information Processing Systems

Visual and audio modalities are highly correlated, yet they contain different information. Their strong correlation makes it possible to predict the semantics of one from the other with good accuracy. Their intrinsic differences make cross-modal prediction a potentially more rewarding pretext task for self-supervised learning of video and audio representations compared to within-modality learning. Based on this intuition, we propose Cross-Modal Deep Clustering (XDC), a novel self-supervised method that leverages unsupervised clustering in one modality (e.g., audio) as a supervisory signal for the other modality (e.g., video). This cross-modal supervision helps XDC utilize the semantic correlation and the differences between the two modalities. Our experiments show that XDC outperforms single-modality clustering and other multi-modal variants. XDC achieves state-of-the-art accuracy among self-supervised methods on multiple video and audio benchmarks. Most importantly, our video model pretrained on large-scale unlabeled data significantly outperforms the same model pretrained with full-supervision on ImageNet and Kinetics for action recognition on HMDB51 and UCF101. To the best of our knowledge, XDC is the first self-supervised learning method that outperforms large-scale fully-supervised pretraining for action recognition on the same architecture.



Decipher-MR: A Vision-Language Foundation Model for 3D MRI Representations

Yang, Zhijian, DSouza, Noel, Megyeri, Istvan, Xu, Xiaojian, Shandiz, Amin Honarmandi, Haddadpour, Farzin, Koos, Krisztian, Rusko, Laszlo, Valeriano, Emanuele, Swaninathan, Bharadwaj, Wu, Lei, Bhatia, Parminder, Kass-Hout, Taha, Bas, Erhan

arXiv.org Artificial Intelligence

Magnetic Resonance Imaging (MRI) is a critical medical imaging modality in clinical diagnosis and research, yet its complexity and heterogeneity pose challenges for automated analysis, particularly in scalable and generalizable machine learning applications. While foundation models have revolutionized natural language and vision tasks, their application to MRI remains limited due to data scarcity and narrow anatomical focus. In this work, we present Decipher-MR, a 3D MRI-specific vision-language foundation model trained on a large-scale dataset comprising 200,000 MRI series from over 22,000 studies spanning diverse anatomical regions, sequences, and pathologies. Decipher-MR integrates self-supervised vision learning with report-guided text supervision to build robust, generalizable representations, enabling effective adaptation across broad applications. To enable robust and diverse clinical tasks with minimal computational overhead, Decipher-MR supports a modular design that enables tuning of lightweight, task-specific decoders attached to a frozen pretrained encoder. Following this setting, we evaluate Decipher-MR across diverse benchmarks including disease classification, demographic prediction, anatomical localization, and cross-modal retrieval, demonstrating consistent performance gains over existing foundation models and task-specific approaches. Our results establish Decipher-MR as a scalable and versatile foundation for MRI-based AI, facilitating efficient development across clinical and research domains.


Data distribution impacts the performance and generalisability of contrastive learning-based foundation models of electrocardiograms

Khattak, Gul Rukh, Patlatzoglou, Konstantinos, Barker, Joseph, Pastika, Libor, Zeidaabadi, Boroumand, El-Medany, Ahmed, Aggour, Hesham, Liang, Yixiu, Ribeiro, Antonio H., Annis, Jeffrey, Ribeiro, Antonio Luiz Pinho, Ge, Junbo, Kramer, Daniel B., Waks, Jonathan W., Brittain, Evan, Peters, Nicholas, Ng, Fu Siong, Sau, Arunashis

arXiv.org Artificial Intelligence

Department of Cardiology, Imperial College Healthcare NHS Trust, London, United Kingdom Disclosures: JWW and DBK were previously on the advisory board for Heartcor solutions LLC, forwhom they remain independent consultants. JWW reports research funding fromAnumana and is a consultant for HeartBeam Inc. FSN reports speaker fees from GEhealthcare and is on the advisory board for Astra Zeneca. The remaining authorshave no conflicts to declare. Heart and Lung Institute, Imperial College LondonHammersmith Campus Du Cane RoadLondon W12 0NN Abstract Contrastive learning is a widely adopted self-supervised pretraining strategy, yet itsdependence on cohort composition remains underexplored. We systematically assess how cohort demographics,health status, and population diversity influence the downstream performance forprediction tasks also including two additional cohorts from another continent (Europe).We find that downstream performance depends on the distributional properties of thepretraining cohort, including demographics and health status. Moreover, whilepretraining with a multi-centre, demographically diverse cohort improves in-distributionaccuracy, it reduces out-of-distribution (OOD) generalisation of our contrastiveapproach by encoding cohort-specific artifacts.


Echo Chamber: RL Post-training Amplifies Behaviors Learned in Pretraining

Zhao, Rosie, Meterez, Alexandru, Kakade, Sham, Pehlevan, Cengiz, Jelassi, Samy, Malach, Eran

arXiv.org Artificial Intelligence

Reinforcement learning (RL)-based fine-tuning has become a crucial step in post-training language models for advanced mathematical reasoning and coding. Following the success of frontier reasoning models, recent work has demonstrated that RL fine-tuning consistently improves performance, even in smaller-scale models; however, the underlying mechanisms driving these improvements are not well-understood. Understanding the effects of RL fine-tuning requires disentangling its interaction with pretraining data composition, hyperparameters, and model scale, but such problems are exacerbated by the lack of transparency regarding the training data used in many existing models. In this work, we present a systematic end-to-end study of RL fine-tuning for mathematical reasoning by training models entirely from scratch on different mixtures of fully open datasets. We investigate the effects of various RL fine-tuning algorithms (PPO, GRPO, and Expert Iteration) across models of different scales. Our study reveals that RL algorithms consistently converge towards a dominant output distribution, amplifying patterns in the pretraining data. We also find that models of different scales trained on the same data mixture will converge to distinct output distributions, suggesting that there are scale-dependent biases in model generalization. Moreover, we find that RL post-training on simpler questions can lead to performance gains on harder ones, indicating that certain reasoning capabilities generalize across tasks. Our findings show that small-scale proxies in controlled settings can elicit interesting insights regarding the role of RL in shaping language model behavior.


Towards Open Respiratory Acoustic Foundation Models: Pretraining and Benchmarking

Neural Information Processing Systems

Respiratory audio, such as coughing and breathing sounds, has predictive power for a wide range of healthcare applications, yet is currently under-explored. The main problem for those applications arises from the difficulty in collecting large labeled task-specific data for model development. Generalizable respiratory acoustic foundation models pretrained with unlabeled data would offer appealing advantages and possibly unlock this impasse. However, given the safety-critical nature of healthcare applications, it is pivotal to also ensure openness and replicability for any proposed foundation model solution. To this end, we introduce OPERA, an OPEn Respiratory Acoustic foundation model pretraining and benchmarking system, as the first approach answering this need. We curate large-scale respiratory audio datasets ( \sim 136K samples, over 400 hours), pretrain three pioneering foundation models, and build a benchmark consisting of 19 downstream respiratory health tasks for evaluation.


Transferring Graph Neural Networks for Soft Sensor Modeling using Process Topologies

Theisen, Maximilian F., Meesters, Gabrie M. H., Schweidtmann, Artur M.

arXiv.org Artificial Intelligence

Data-driven soft sensors help in process operations by providing real-time estimates of otherwise hard- to-measure process quantities, e.g., viscosities or product concentrations. Currently, soft sensors need to be developed individually per plant. Using transfer learning, machine learning-based soft sensors could be reused and fine-tuned across plants and applications. However, transferring data-driven soft sensor models is in practice often not possible, because the fixed input structure of standard soft sensor models prohibits transfer if, e.g., the sensor information is not identical in all plants. We propose a topology-aware graph neural network approach for transfer learning of soft sensor models across multiple plants. In our method, plants are modeled as graphs: Unit operations are nodes, streams are edges, and sensors are embedded as attributes. Our approach brings two advantages for transfer learning: First, we not only include sensor data but also crucial information on the plant topology. Second, the graph neural network algorithm is flexible with respect to its sensor inputs. This allows us to model data from different plants with different sensor networks. We test the transfer learning capabilities of our modeling approach on ammonia synthesis loops with different process topologies. We build a soft sensor predicting the ammonia concentration in the product. After training on data from one process, we successfully transfer our soft sensor model to a previously unseen process with a different topology. Our approach promises to extend the data-driven soft sensors to cases to leverage data from multiple plants.


Promoting cross-modal representations to improve multimodal foundation models for physiological signals

Fang, Ching, Sandino, Christopher, Mahasseni, Behrooz, Minxha, Juri, Pouransari, Hadi, Azemi, Erdrin, Moin, Ali, Zippi, Ellen

arXiv.org Artificial Intelligence

Many healthcare applications are inherently multimodal, involving several physiological signals. As sensors for these signals become more common, improving machine learning methods for multimodal healthcare data is crucial. Pretraining foundation models is a promising avenue for success. However, methods for developing foundation models in healthcare are still in early exploration and it is unclear which pretraining strategies are most effective given the diversity of physiological signals. This is partly due to challenges in multimodal health data: obtaining data across many patients is difficult and costly, there is a lot of inter-subject variability, and modalities are often heterogeneously informative across downstream tasks. Here, we explore these challenges in the PhysioNet 2018 dataset. We use a masked autoencoding objective to pretrain a multimodal model. We show that the model learns representations that can be linearly probed for a diverse set of downstream tasks. We hypothesize that cross-modal reconstruction objectives are important for successful multimodal training, as they encourage the model to integrate information across modalities. We demonstrate that modality dropout in the input space improves performance across downstream tasks. We also find that late-fusion models pretrained with contrastive learning objectives are less effective across multiple tasks. Finally, we analyze the model's representations, showing that attention weights become more cross-modal and temporally aligned with our pretraining strategy. The learned embeddings also become more distributed in terms of the modalities encoded by each unit. Overall, our work demonstrates the utility of multimodal foundation models with health data, even across diverse physiological data sources. We further argue that explicit methods for inducing cross-modality may enhance multimodal pretraining strategies.