mil
Navigating the MIL Trade-Off: Flexible Pooling for Whole Slide Image Classification
Multiple Instance Learning (MIL) is a standard weakly supervised approach for Whole Slide Image (WSI) classification, where performance hinges on both feature representation and MIL pooling strategies. Recent research has predominantly focused on Transformer-based architectures adapted for WSIs. However, we argue that this trend faces a fundamental limitation: data scarcity. In typical settings, Transformer models yield only marginal gains without access to large-scale datasets--resources that are virtually inaccessible to all but a few well-funded research labs. Motivated by this, we revisit simple, non-attention MIL with unsupervised slide features and analyze temperature-$\beta$-controlled log-sum-exp (LSE) pooling.
Details
A.1 Difference between the performance of two joint policies In Section 3.1, the difference between the performance of two joint policies is expressed as follows: The proof is a multi-agent version of the proof in (Kakade and Langford, 2002). Now we provide the mathematical detail formally. A.2 Approximation that matches the true value to first order In Section 3.1, we claim that Jฯ( ฯ) matches J( ฯ) to first order. Intuitively, this means that a sufficiently small update of the joint policy which improves Jฯ( ฯ) will also improve J( ฯ). Now we prove it formally.
Synthetic Data Reveals Generalization Gaps in Correlated Multiple Instance Learning
Harvey, Ethan, Loevlie, Dennis Johan, Hughes, Michael C.
Multiple instance learning (MIL) is often used in medical imaging to classify high-resolution 2D images by processing patches or classify 3D volumes by processing slices. However, conventional MIL approaches treat instances separately, ignoring contextual relationships such as the appearance of nearby patches or slices that can be essential in real applications. We design a synthetic classification task where accounting for adjacent instance features is crucial for accurate prediction. We demonstrate the limitations of off-the-shelf MIL approaches by quantifying their performance compared to the optimal Bayes estimator for this task, which is available in closed-form. We empirically show that newer correlated MIL methods still do not achieve the best possible performance when trained with ten thousand training samples, each containing many instances.
GAS-MIL: Group-Aggregative Selection Multi-Instance Learning for Ensemble of Foundation Models in Digital Pathology Image Analysis
Quan, Peiran, Gu, Zifan, Zhao, Zhuo, Zhou, Qin, Yang, Donghan M., Rong, Ruichen, Xie, Yang, Xiao, Guanghua
Foundation models (FMs) have transformed computational pathology by providing powerful, general - purpose feature extractors. However, adapting and benchmarking individual FMs for specific diagnostic tasks is often time - consuming and resource - intensive, espe cially given their scale and diversity. To address this challenge, we introduce Group - Aggregative Selection Multi - Instance Learning (GAS - MIL), a flexible ensemble framework that seamlessly integrates features from multiple FMs, preserving their complementa ry strengths without requiring manual feature selection or extensive task - specific fine - tuning. Across classification tasks in three cancer datasets -- prostate (PANDA), ovarian (UBC - OCEAN), and breast (TCGA - BrCa) -- GAS - MIL consistently achieves superior or on - par performance relative to individual FMs and established MIL methods, demonstrating its robustness and generalizability. By enabling efficient int egration of heterogeneous FMs, GAS - MIL streamlines model deployment for pathology and provides a scalable foundation for future multimodal and precision oncology applications.
An ablation study over different model architectures (Table (a)) shows that the chosen
FB15k's lack of hierarchy offers no advantage to hyperbolic embeddings, but its large number MuRP does not also set out to include MTL, but we hope to address this in future work. We will include all recommendations, e.g. However, we agree that it is important to compare models across a range of dimensionalities. Note that for MuRP with biases replaced by (transformed) norms, performance reduces (e.g. Multi-relational transforms and Justification for architecture: See "Architecture ablation study".
A Transformer-Based Conditional GAN with Multiple Instance Learning for UAV Signal Detection and Classification
Liu, Haochen, Bi, Jia, Wang, Xiaomin, Yang, Xin, Wang, Ling
Unmanned Aerial Vehicles (UAVs) are increasingly used in surveillance, logistics, agriculture, disaster management, and military operations. Accurate detection and classification of UAV flight states, such as hovering, cruising, ascending, or transitioning, which are essential for safe and effective operations. However, conventional time series classification (TSC) methods often lack robustness and generalization for dynamic UAV environments, while state of the art(SOTA) models like Transformers and LSTM based architectures typically require large datasets and entail high computational costs, especially with high-dimensional data streams. This paper proposes a novel framework that integrates a Transformer-based Generative Adversarial Network (GAN) with Multiple Instance Locally Explainable Learning (MILET) to address these challenges in UAV flight state classification. The Transformer encoder captures long-range temporal dependencies and complex telemetry dynamics, while the GAN module augments limited datasets with realistic synthetic samples. MIL is incorporated to focus attention on the most discriminative input segments, reducing noise and computational overhead. Experimental results show that the proposed method achieves superior accuracy 96.5% on the DroneDetect dataset and 98.6% on the DroneRF dataset that outperforming other SOTA approaches. The framework also demonstrates strong computational efficiency and robust generalization across diverse UAV platforms and flight states, highlighting its potential for real-time deployment in resource constrained environments.
Multi-instance Learning as Downstream Task of Self-Supervised Learning-based Pre-trained Model
Matsuishi, Koki, Okita, Tsuyoshi
In deep multi-instance learning, the number of applicable instances depends on the data set. In histopathology images, deep learning multi-instance learners usually assume there are hundreds to thousands instances in a bag. However, when the number of instances in a bag increases to 256 in brain hematoma CT, learning becomes extremely difficult. In this paper, we address this drawback. To overcome this problem, we propose using a pre-trained model with self-supervised learning for the multi-instance learner as a downstream task. With this method, even when the original target task suffers from the spurious correlation problem, we show improvements of 5% to 13% in accuracy and 40% to 55% in the F1 measure for the hypodensity marker classification of brain hematoma CT.
Zero-Shot, But at What Cost? Unveiling the Hidden Overhead of MILS's LLM-CLIP Framework for Image Captioning
Benhammou, Yassir, Tiberio, Alessandro, Trautmann, Gabriel, Kalyan, Suman
MILS (Multimodal Iterative LLM Solver) is a recently published framework that claims "LLMs can see and hear without any training" by leveraging an iterative, LLM-CLIP based approach for zero-shot image captioning. While this MILS approach demonstrates good performance, our investigation reveals that this success comes at a hidden, substantial computational cost due to its expensive multi-step refinement process. In contrast, alternative models such as BLIP-2 and GPT-4V achieve competitive results through a streamlined, single-pass approach. We hypothesize that the significant overhead inherent in MILS's iterative process may undermine its practical benefits, thereby challenging the narrative that zero-shot performance can be attained without incurring heavy resource demands. This work is the first to expose and quantify the trade-offs between output quality and computational cost in MILS, providing critical insights for the design of more efficient multimodal models.