Kanan, Christopher
A Good Start Matters: Enhancing Continual Learning with Data-Driven Weight Initialization
Harun, Md Yousuf, Kanan, Christopher
To adapt to real-world data streams, continual learning (CL) systems must rapidly learn new concepts while preserving and utilizing prior knowledge. When it comes to adding new information to continually-trained deep neural networks (DNNs), classifier weights for newly encountered categories are typically initialized randomly, leading to high initial training loss (spikes) and instability. Consequently, achieving optimal convergence and accuracy requires prolonged training, increasing computational costs. Inspired by Neural Collapse (NC), we propose a weight initialization strategy to improve learning efficiency in CL. In DNNs trained with mean-squared-error, NC gives rise to a Least-Square (LS) classifier in the last layer, whose weights can be analytically derived from learned features. Our method mitigates initial loss spikes and accelerates adaptation to new tasks. We evaluate our approach in large-scale CL settings, demonstrating faster adaptation and improved CL performance. Deep learning models excel in static environments where the data follows an independent and identically distributed (IID) assumption. However, in real-world scenarios, data distributions shift over time (non-IID), and new data arrives sequentially. Conventional deep neural networks (DNNs) struggle under such conditions, often requiring periodic re-training from scratch, which is not only computationally expensive but also contributes significantly to the carbon footprint of AI (Schwartz et al., 2020). Despite frequent retraining from scratch, real-world models still suffer up to 40% accuracy drops (Mallick et al., 2022). Continual learning (CL) aims to address this inefficiency by enabling models to learn from evolving data streams while preserving previously acquired knowledge (Parisi et al., 2019). CL is a promising solution to model decay, where predictive performance deteriorates over time due to concept drift--a shift in the meaning or distribution of target variables (Tsymbal, 2004; Gama et al., 2014; Lu et al., 2018).
Controlling Neural Collapse Enhances Out-of-Distribution Detection and Transfer Learning
Harun, Md Yousuf, Gallardo, Jhair, Kanan, Christopher
Out-of-distribution (OOD) detection and OOD generalization are widely studied in Deep Neural Networks (DNNs), yet their relationship remains poorly understood. We empirically show that the degree of Neural Collapse (NC) in a network layer is inversely related with these objectives: stronger NC improves OOD detection but degrades generalization, while weaker NC enhances generalization at the cost of detection. This trade-off suggests that a single feature space cannot simultaneously achieve both tasks. To address this, we develop a theoretical framework linking NC to OOD detection and generalization. We show that entropy regularization mitigates NC to improve Figure 1: In this paper, we show that there is a close inverse generalization, while a fixed Simplex Equiangular relationship between OOD detection and generalization with Tight Frame (ETF) projector enforces NC for better respect to the degree of representation collapse in DNN detection. Based on these insights, we propose layers. This plot illustrates this relationship for VGG17 pretrained a method to control NC at different DNN layers.
INSIGHT: Explainable Weakly-Supervised Medical Image Analysis
Zhang, Wenbo, Chen, Junyu, Kanan, Christopher
Processing such pathology images (WSIs) are often processed by extracting data end-to-end with deep neural networks is computationally embeddings from local regions and then an aggregator infeasible. Instead, pipelines rely on aggregators, which makes predictions from this set. However, current methods synthesize local embeddings extracted from tiles (WSIs) or require post-hoc visualization techniques (e.g., Grad-CAM) slices (volumes) into global predictions [5, 6, 23]. While and often fail to localize small yet clinically crucial details. this divide-and-conquer strategy is efficient, current methods To address these limitations, we introduce INSIGHT, a often discard spatial information during feature aggregation novel weakly-supervised aggregator that integrates heatmap and depend on post-hoc visualization tools, such as Grad-generation as an inductive bias. Starting from pre-trained CAM [33], to generate interpretable heatmaps. These visualizations feature maps, INSIGHT employs a detection module with are prone to missing clinically significant features small convolutional kernels to capture fine details and a and introduce additional complexity.
Improving Multimodal Large Language Models Using Continual Learning
Srivastava, Shikhar, Harun, Md Yousuf, Shrestha, Robik, Kanan, Christopher
Generative large language models (LLMs) exhibit impressive capabilities, which can be further augmented by integrating a pre-trained vision model into the original LLM to create a multimodal LLM (MLLM). However, this integration often significantly decreases performance on natural language understanding and generation tasks, compared to the original LLM. This study investigates this issue using the LLaVA MLLM, treating the integration as a continual learning problem. We evaluate five continual learning methods to mitigate forgetting and identify a technique that enhances visual understanding while minimizing linguistic performance loss. Our approach reduces linguistic performance degradation by up to 15% over the LLaVA recipe, while maintaining high multimodal accuracy. We also demonstrate the robustness of our method through continual learning on a sequence of vision-language tasks, effectively preserving linguistic skills while acquiring new multimodal capabilities. Figure 1: Summary results of the best CL methods we evaluated for training LLaVA 1.5 compared to the unimodal base LLM and the original version of LLaVA 1.5. All results are with Pythia 2.8B as the base LLM. The best method has almost the same vision-language (VL) accuracy while providing a large increase in linguistic performance on 1 NLG and 4 NLU tasks by 8% and 2% (absolute), resp.
What Variables Affect Out-Of-Distribution Generalization in Pretrained Models?
Harun, Md Yousuf, Lee, Kyungbok, Gallardo, Jhair, Krishnan, Giri, Kanan, Christopher
Embeddings produced by pre-trained deep neural networks (DNNs) are widely used; however, their efficacy for downstream tasks can vary widely. We study the factors influencing out-of-distribution (OOD) generalization of pre-trained DNN embeddings through the lens of the tunnel effect hypothesis, which suggests deeper DNN layers compress representations and hinder OOD performance. Contrary to earlier work, we find the tunnel effect is not universal. Based on 10,584 linear probes, we study the conditions that mitigate the tunnel effect by varying DNN architecture, training dataset, image resolution, and augmentations. We quantify each variable's impact using a novel SHAP analysis. Our results emphasize the danger of generalizing findings from toy datasets to broader contexts.
PRISM: A Multi-Modal Generative Foundation Model for Slide-Level Histopathology
Shaikovski, George, Casson, Adam, Severson, Kristen, Zimmermann, Eric, Wang, Yi Kan, Kunz, Jeremy D., Retamero, Juan A., Oakley, Gerard, Klimstra, David, Kanan, Christopher, Hanna, Matthew, Zelechowski, Michal, Viret, Julian, Tenenholtz, Neil, Hall, James, Fusi, Nicolo, Yousfi, Razik, Hamilton, Peter, Moye, William A., Vorontsov, Eugene, Liu, Siqi, Fuchs, Thomas J.
Foundation models in computational pathology promise to unlock the development of new clinical decision support systems and models for precision medicine. However, there is a mismatch between most clinical analysis, which is defined at the level of one or more whole slide images, and foundation models to date, which process the thousands of image tiles contained in a whole slide image separately. The requirement to train a network to aggregate information across a large number of tiles in multiple whole slide images limits these models' impact. In this work, we present a slide-level foundation model for H&E-stained histopathology, PRISM, that builds on Virchow tile embeddings and leverages clinical report text for pre-training. Using the tile embeddings, PRISM produces slide-level embeddings with the ability to generate clinical reports, resulting in several modes of use. Using text prompts, PRISM achieves zero-shot cancer detection and sub-typing performance approaching and surpassing that of a supervised aggregator model. Using the slide embeddings with linear classifiers, PRISM surpasses supervised aggregator models. Furthermore, we demonstrate that fine-tuning of the PRISM slide encoder yields label-efficient training for biomarker prediction, a task that typically suffers from low availability of training data; an aggregator initialized with PRISM and trained on as little as 10% of the training data can outperform a supervised baseline that uses all of the data.
Virchow: A Million-Slide Digital Pathology Foundation Model
Vorontsov, Eugene, Bozkurt, Alican, Casson, Adam, Shaikovski, George, Zelechowski, Michal, Liu, Siqi, Severson, Kristen, Zimmermann, Eric, Hall, James, Tenenholtz, Neil, Fusi, Nicolo, Mathieu, Philippe, van Eck, Alexander, Lee, Donghun, Viret, Julian, Robert, Eric, Wang, Yi Kan, Kunz, Jeremy D., Lee, Matthew C. H., Bernhard, Jan, Godrich, Ran A., Oakley, Gerard, Millar, Ewan, Hanna, Matthew, Retamero, Juan, Moye, William A., Yousfi, Razik, Kanan, Christopher, Klimstra, David, Rothrock, Brandon, Fuchs, Thomas J.
The use of artificial intelligence to enable precision medicine and decision support systems through the analysis of pathology images has the potential to revolutionize the diagnosis and treatment of cancer. Such applications will depend on models' abilities to capture the diverse patterns observed in pathology images. To address this challenge, we present Virchow, a foundation model for computational pathology. Using self-supervised learning empowered by the DINOv2 algorithm, Virchow is a vision transformer model with 632 million parameters trained on 1.5 million hematoxylin and eosin stained whole slide images from diverse tissue and specimen types, which is orders of magnitude more data than previous works. The Virchow model enables the development of a pan-cancer detection system with 0.949 overall specimen-level AUC across 17 different cancer types, while also achieving 0.937 AUC on 7 rare cancer types. The Virchow model sets the state-of-the-art on the internal and external image tile level benchmarks and slide level biomarker prediction tasks. The gains in performance highlight the importance of training on massive pathology image datasets, suggesting scaling up the data and network architecture can improve the accuracy for many high-impact computational pathology applications where limited amounts of training data are available.
DMC4ML: Data Movement Complexity for Machine Learning
Ding, Chen, Kanan, Christopher, McKellips, Dylan, Ozawa, Toranosuke, Shahmirza, Arian, Smith, Wesley
The greatest demand for today's computing is machine learning. This paper analyzes three machine learning algorithms: transformers, spatial convolution, and FFT. The analysis is novel in three aspects. First, it measures the cost of memory access on an abstract memory hierarchy, instead of traditional time or space complexity. Second, the analysis is asymptotic and identifies the primary sources of the memory cost. Finally, the result is symbolic, which can be used to select algorithmic parameters such as the group size in grouped query attention for any dimension size and number of heads and the batch size for batched convolution for any image size and kernel size.
BloomVQA: Assessing Hierarchical Multi-modal Comprehension
Gong, Yunye, Shrestha, Robik, Claypoole, Jared, Cogswell, Michael, Ray, Arijit, Kanan, Christopher, Divakaran, Ajay
We propose a novel VQA dataset, based on picture stories designed for educating young children, that aims to facilitate comprehensive evaluation and characterization of vision-language models on comprehension tasks. Unlike current VQA datasets that often focus on fact-based memorization and simple reasoning tasks without principled scientific grounding, we collect data containing tasks reflecting different levels of comprehension and underlying cognitive processes, as laid out in Bloom's Taxonomy, a classic framework widely adopted in education research. The proposed BloomVQA dataset can be mapped to a hierarchical graph-based representation of visual stories, enabling automatic data augmentation and novel measures characterizing model consistency across the underlying taxonomy. We demonstrate graded evaluation and reliability analysis based on our proposed consistency metrics on state-of-the-art vision-language models. Our results suggest that, while current models achieve the most gain on low-level comprehension tasks, they generally fall short on high-level tasks requiring more advanced comprehension and cognitive skills, as 38.0% drop in VQA accuracy is observed comparing lowest and highest level tasks. Furthermore, current models show consistency patterns misaligned with human comprehension in various scenarios, suggesting emergent structures of model behaviors.
Continual Learning: Applications and the Road Forward
Verwimp, Eli, Aljundi, Rahaf, Ben-David, Shai, Bethge, Matthias, Cossu, Andrea, Gepperth, Alexander, Hayes, Tyler L., Hüllermeier, Eyke, Kanan, Christopher, Kudithipudi, Dhireesha, Lampert, Christoph H., Mundt, Martin, Pascanu, Razvan, Popescu, Adrian, Tolias, Andreas S., van de Weijer, Joost, Liu, Bing, Lomonaco, Vincenzo, Tuytelaars, Tinne, van de Ven, Gido M.
Continual learning is a sub-field of machine learning, which aims to allow machine learning models to continuously learn on new data, by accumulating knowledge without forgetting what was learned in the past. In this work, we take a step back, and ask: "Why should one care about continual learning in the first place?". We set the stage by surveying recent continual learning papers published at three major machine learning conferences, and show that memory-constrained settings dominate the field. Then, we discuss five open problems in machine learning, and even though they seem unrelated to continual learning at first sight, we show that continual learning will inevitably be part of their solution. These problems are model-editing, personalization, on-device learning, faster (re-)training and reinforcement learning. Finally, by comparing the desiderata from these unsolved problems and the current assumptions in continual learning, we highlight and discuss four future directions for continual learning research. We hope that this work offers an interesting perspective on the future of continual learning, while displaying its potential value and the paths we have to pursue in order to make it successful. This work is the result of the many discussions the authors had at the Dagstuhl seminar on Deep Continual Learning, in March 2023.