right 0
I Spy With My Model's Eye: Visual Search as a Behavioural Test for MLLMs
Burden, John, Prunty, Jonathan, Slater, Ben, Tehenan, Matthieu, Davis, Greg, Cheke, Lucy
Multimodal large language models (MLLMs) achieve strong performance on vision-language tasks, yet their visual processing is opaque. Most black-box evaluations measure task accuracy, but reveal little about underlying mechanisms. Drawing on cognitive psychology, we adapt classic visual search paradigms -- originally developed to study human perception -- to test whether MLLMs exhibit the ``pop-out'' effect, where salient visual features are detected independently of distractor set size. Using controlled experiments targeting colour, size and lighting features, we find that advanced MLLMs exhibit human-like pop-out effects in colour or size-based disjunctive (single feature) search, as well as capacity limits for conjunctive (multiple feature) search. We also find evidence to suggest that MLLMs, like humans, incorporate natural scene priors such as lighting direction into object representations. We reinforce our findings using targeted fine-tuning and mechanistic interpretability analyses. Our work shows how visual search can serve as a cognitively grounded diagnostic tool for evaluating perceptual capabilities in MLLMs.
Predicting Anthropometric Body Composition Variables Using 3D Optical Imaging and Machine Learning
Agrahari, Gyaneshwar, Bist, Kiran, Pandey, Monika, Kapita, Jacob, James, Zachary, Knox, Jackson, Heymsfield, Steven, Ramirez, Sophia, Wolenski, Peter, Drenska, Nadejda
Accurate prediction of anthropometric body composition variables, such as Appendicular Lean Mass (ALM), Body Fat Percentage (BFP), and Bone Mineral Density (BMD), is essential for early diagnosis of several chronic diseases. Currently, researchers rely on Dual-Energy X-ray Absorptiometry (DXA) scans to measure these metrics; however, DXA scans are costly and time-consuming. This work proposes an alternative to DXA scans by applying statistical and machine learning models on biomarkers (height, volume, left calf circumference, etc) obtained from 3D optical images. The dataset consists of 847 patients and was sourced from Pennington Biomedical Research Center. Extracting patients' data in healthcare faces many technical challenges and legal restrictions. However, most supervised machine learning algorithms are inherently data-intensive, requiring a large amount of training data. To overcome these limitations, we implemented a semi-supervised model, the $p$-Laplacian regression model. This paper is the first to demonstrate the application of a $p$-Laplacian model for regression. Our $p$-Laplacian model yielded errors of $\sim13\%$ for ALM, $\sim10\%$ for BMD, and $\sim20\%$ for BFP when the training data accounted for 10 percent of all data. Among the supervised algorithms we implemented, Support Vector Regression (SVR) performed the best for ALM and BMD, yielding errors of $\sim 8\%$ for both, while Least Squares SVR performed the best for BFP with $\sim 11\%$ error when trained on 80 percent of the data. Our findings position the $p$-Laplacian model as a promising tool for healthcare applications, particularly in a data-constrained environment.
UKBOB: One Billion MRI Labeled Masks for Generalizable 3D Medical Image Segmentation
Bourigault, Emmanuelle, Jamaludin, Amir, Hamdi, Abdullah
In medical imaging, the primary challenge is collecting large-scale labeled data due to privacy concerns, logistics, and high labeling costs. In this work, we present the UK Biobank Organs and Bones (UKBOB), the largest labeled dataset of body organs, comprising 51,761 MRI 3D samples (equivalent to 17.9 million 2D images) and more than 1.37 billion 2D segmentation masks of 72 organs, all based on the UK Biobank MRI dataset. We utilize automatic labeling, introduce an automated label cleaning pipeline with organ-specific filters, and manually annotate a subset of 300 MRIs with 11 abdominal classes to validate the quality (referred to as UKBOB-manual). This approach allows for scaling up the dataset collection while maintaining confidence in the labels. We further confirm the validity of the labels by demonstrating zero-shot generalization of trained models on the filtered UKBOB to other small labeled datasets from similar domains (e.g., abdominal MRI). To further mitigate the effect of noisy labels, we propose a novel method called Entropy Test-time Adaptation (ETTA) to refine the segmentation output. We use UKBOB to train a foundation model, Swin-BOB, for 3D medical image segmentation based on the Swin-UNetr architecture, achieving state-of-the-art results in several benchmarks in 3D medical imaging, including the BRATS brain MRI tumor challenge (with a 0.4% improvement) and the BTCV abdominal CT scan benchmark (with a 1.3% improvement). The pre-trained models and the code are available at https://emmanuelleb985.github.io/ukbob , and the filtered labels will be made available with the UK Biobank.
Transformers Use Causal World Models in Maze-Solving Tasks
Spies, Alex F., Edwards, William, Ivanitskiy, Michael I., Skapars, Adrians, Rรคuker, Tilman, Inoue, Katsumi, Russo, Alessandra, Shanahan, Murray
Recent studies in interpretability have explored the inner workings of transformer models trained on tasks across various domains, often discovering that these networks naturally develop surprisingly structured representations. When such representations comprehensively reflect the task domain's structure, they are commonly referred to as ``World Models'' (WMs). In this work, we discover such WMs in transformers trained on maze tasks. In particular, by employing Sparse Autoencoders (SAEs) and analysing attention patterns, we examine the construction of WMs and demonstrate consistency between the circuit analysis and the SAE feature-based analysis. We intervene upon the isolated features to confirm their causal role and, in doing so, find asymmetries between certain types of interventions. Surprisingly, we find that models are able to reason with respect to a greater number of active features than they see during training, even if attempting to specify these in the input token sequence would lead the model to fail. Futhermore, we observe that varying positional encodings can alter how WMs are encoded in a model's residual stream. By analyzing the causal role of these WMs in a toy domain we hope to make progress toward an understanding of emergent structure in the representations acquired by Transformers, leading to the development of more interpretable and controllable AI systems.
How Susceptible are Large Language Models to Ideological Manipulation?
Chen, Kai, He, Zihao, Yan, Jun, Shi, Taiwei, Lerman, Kristina
Large Language Models (LLMs) possess the potential to exert substantial influence on public perceptions and interactions with information. This raises concerns about the societal impact that could arise if the ideologies within these models can be easily manipulated. In this work, we investigate how effectively LLMs can learn and generalize ideological biases from their instruction-tuning data. Our findings reveal a concerning vulnerability: exposure to only a small amount of ideologically driven samples significantly alters the ideology of LLMs. Notably, LLMs demonstrate a startling ability to absorb ideology from one topic and generalize it to even unrelated ones. The ease with which LLMs' ideologies can be skewed underscores the risks associated with intentionally poisoned training data by malicious actors or inadvertently introduced biases by data annotators. It also emphasizes the imperative for robust safeguards to mitigate the influence of ideological manipulations on LLMs.
MRSegmentator: Robust Multi-Modality Segmentation of 40 Classes in MRI and CT Sequences
Hรคntze, Hartmut, Xu, Lina, Dorfner, Felix J., Donle, Leonhard, Truhn, Daniel, Aerts, Hugo, Prokop, Mathias, van Ginneken, Bram, Hering, Alessa, Adams, Lisa C., Bressem, Keno K.
Purpose: To introduce a deep learning model capable of multi-organ segmentation in MRI scans, offering a solution to the current limitations in MRI analysis due to challenges in resolution, standardized intensity values, and variability in sequences. Materials and Methods: he model was trained on 1,200 manually annotated MRI scans from the UK Biobank, 221 in-house MRI scans and 1228 CT scans, leveraging cross-modality transfer learning from CT segmentation models. A human-in-the-loop annotation workflow was employed to efficiently create high-quality segmentations. The model's performance was evaluated on NAKO and the AMOS22 dataset containing 600 and 60 MRI examinations. Dice Similarity Coefficient (DSC) and Hausdorff Distance (HD) was used to assess segmentation accuracy. The model will be open sourced. Results: The model showcased high accuracy in segmenting well-defined organs, achieving Dice Similarity Coefficient (DSC) scores of 0.97 for the right and left lungs, and 0.95 for the heart. It also demonstrated robustness in organs like the liver (DSC: 0.96) and kidneys (DSC: 0.95 left, 0.95 right), which present more variability. However, segmentation of smaller and complex structures such as the portal and splenic veins (DSC: 0.54) and adrenal glands (DSC: 0.65 left, 0.61 right) revealed the need for further model optimization. Conclusion: The proposed model is a robust, tool for accurate segmentation of 40 anatomical structures in MRI and CT images. By leveraging cross-modality learning and interactive annotation, the model achieves strong performance and generalizability across diverse datasets, making it a valuable resource for researchers and clinicians. It is open source and can be downloaded from https://github.com/hhaentze/MRSegmentator.
COPD-FlowNet: Elevating Non-invasive COPD Diagnosis with CFD Simulations
Tyagi, Aryan, Rao, Aryaman, Rao, Shubhanshu, Singh, Raj Kumar
Chronic Obstructive Pulmonary Disorder (COPD) is a prevalent respiratory disease that significantly impacts the quality of life of affected individuals. This paper presents COPDFlowNet, a novel deep-learning framework that leverages a custom Generative Adversarial Network (GAN) to generate synthetic Computational Fluid Dynamics (CFD) velocity flow field images specific to the trachea of COPD patients. These synthetic images serve as a valuable resource for data augmentation and model training. Additionally, COPDFlowNet incorporates a custom Convolutional Neural Network (CNN) architecture to predict the location of the obstruction site.
Private, fair and accurate: Training large-scale, privacy-preserving AI models in medical imaging
Arasteh, Soroosh Tayebi, Ziller, Alexander, Kuhl, Christiane, Makowski, Marcus, Nebelung, Sven, Braren, Rickmer, Rueckert, Daniel, Truhn, Daniel, Kaissis, Georgios
Artificial intelligence (AI) models are increasingly used in the medical domain. However, as medical data is highly sensitive, special precautions to ensure its protection are required. The gold standard for privacy preservation is the introduction of differential privacy (DP) to model training. Prior work indicates that DP has negative implications on model accuracy and fairness, which are unacceptable in medicine and represent a main barrier to the widespread use of privacy-preserving techniques. In this work, we evaluated the effect of privacy-preserving training of AI models for chest radiograph diagnosis regarding accuracy and fairness compared to non-private training. For this, we used a large dataset (N=193,311) of high quality clinical chest radiographs, which were retrospectively collected and manually labeled by experienced radiologists. We then compared non-private deep convolutional neural networks (CNNs) and privacy-preserving (DP) models with respect to privacy-utility trade-offs measured as area under the receiver-operator-characteristic curve (AUROC), and privacy-fairness trade-offs, measured as Pearson's r or Statistical Parity Difference. We found that the non-private CNNs achieved an average AUROC score of 0.90 +- 0.04 over all labels, whereas the DP CNNs with a privacy budget of epsilon=7.89 resulted in an AUROC of 0.87 +- 0.04, i.e., a mere 2.6% performance decrease compared to non-private training. Furthermore, we found the privacy-preserving training not to amplify discrimination against age, sex or co-morbidity. Our study shows that -- under the challenging realistic circumstances of a real-life clinical dataset -- the privacy-preserving training of diagnostic deep learning models is possible with excellent diagnostic accuracy and fairness.
A New Robust Scalable Singular Value Decomposition Algorithm for Video Surveillance Background Modelling
Roy, Subhrajyoty, Basu, Ayanendranath, Ghosh, Abhik
A basic algorithmic task in automated video surveillance is to separate background and foreground objects. Camera tampering, noisy videos, low frame rate, etc., pose difficulties in solving the problem. A general approach which classifies the tampered frames, and performs subsequent analysis on the remaining frames after discarding the tampered ones, results in loss of information. We propose a robust singular value decomposition (SVD) approach based on the density power divergence to perform background separation robustly even in the presence of tampered frames. We also provide theoretical results and perform simulations to validate the superiority of the proposed method over the few existing robust SVD methods. Finally, we indicate several other use-cases of the proposed method to show its general applicability to a large range of problems.