Green, Bradley
Capabilities of Gemini Models in Medicine
Saab, Khaled, Tu, Tao, Weng, Wei-Hung, Tanno, Ryutaro, Stutz, David, Wulczyn, Ellery, Zhang, Fan, Strother, Tim, Park, Chunjong, Vedadi, Elahe, Chaves, Juanma Zambrano, Hu, Szu-Yeu, Schaekermann, Mike, Kamath, Aishwarya, Cheng, Yong, Barrett, David G. T., Cheung, Cathy, Mustafa, Basil, Palepu, Anil, McDuff, Daniel, Hou, Le, Golany, Tomer, Liu, Luyang, Alayrac, Jean-baptiste, Houlsby, Neil, Tomasev, Nenad, Freyberg, Jan, Lau, Charles, Kemp, Jonas, Lai, Jeremy, Azizi, Shekoofeh, Kanada, Kimberly, Man, SiWai, Kulkarni, Kavita, Sun, Ruoxi, Shakeri, Siamak, He, Luheng, Caine, Ben, Webson, Albert, Latysheva, Natasha, Johnson, Melvin, Mansfield, Philip, Lu, Jian, Rivlin, Ehud, Anderson, Jesper, Green, Bradley, Wong, Renee, Krause, Jonathan, Shlens, Jonathon, Dominowska, Ewa, Eslami, S. M. Ali, Chou, Katherine, Cui, Claire, Vinyals, Oriol, Kavukcuoglu, Koray, Manyika, James, Dean, Jeff, Hassabis, Demis, Matias, Yossi, Webster, Dale, Barral, Joelle, Corrado, Greg, Semturs, Christopher, Mahdavi, S. Sara, Gottweis, Juraj, Karthikesalingam, Alan, Natarajan, Vivek
Excellence in a wide variety of medical applications poses considerable challenges for AI, requiring advanced reasoning, access to up-to-date medical knowledge and understanding of complex multimodal data. Gemini models, with strong general capabilities in multimodal and long-context reasoning, offer exciting possibilities in medicine. Building on these core strengths of Gemini, we introduce Med-Gemini, a family of highly capable multimodal models that are specialized in medicine with the ability to seamlessly use web search, and that can be efficiently tailored to novel modalities using custom encoders. We evaluate Med-Gemini on 14 medical benchmarks, establishing new state-of-the-art (SoTA) performance on 10 of them, and surpass the GPT-4 model family on every benchmark where a direct comparison is viable, often by a wide margin. On the popular MedQA (USMLE) benchmark, our best-performing Med-Gemini model achieves SoTA performance of 91.1% accuracy, using a novel uncertainty-guided search strategy. On 7 multimodal benchmarks including NEJM Image Challenges and MMMU (health & medicine), Med-Gemini improves over GPT-4V by an average relative margin of 44.5%. We demonstrate the effectiveness of Med-Gemini's long-context capabilities through SoTA performance on a needle-in-a-haystack retrieval task from long de-identified health records and medical video question answering, surpassing prior bespoke methods using only in-context learning. Finally, Med-Gemini's performance suggests real-world utility by surpassing human experts on tasks such as medical text summarization, alongside demonstrations of promising potential for multimodal medical dialogue, medical research and education. Taken together, our results offer compelling evidence for Med-Gemini's potential, although further rigorous evaluation will be crucial before real-world deployment in this safety-critical domain.
Augmentations vs Algorithms: What Works in Self-Supervised Learning
Morningstar, Warren, Bijamov, Alex, Duvarney, Chris, Friedman, Luke, Kalibhat, Neha, Liu, Luyang, Mansfield, Philip, Rojas-Gomez, Renan, Singhal, Karan, Green, Bradley, Prakash, Sushant
We study the relative effects of data augmentations, pretraining algorithms, and model architectures in Self-Supervised Learning (SSL). While the recent literature in this space leaves the impression that the pretraining algorithm is of critical importance to performance, understanding its effect is complicated by the difficulty in making objective and direct comparisons between methods. We propose a new framework which unifies many seemingly disparate SSL methods into a single shared template. Using this framework, we identify aspects in which methods differ and observe that in addition to changing the pretraining algorithm, many works also use new data augmentations or more powerful model architectures. We compare several popular SSL methods using our framework and find that many algorithmic additions, such as prediction networks or new losses, have a minor impact on downstream task performance (often less than $1\%$), while enhanced augmentation techniques offer more significant performance improvements ($2-4\%$). Our findings challenge the premise that SSL is being driven primarily by algorithmic improvements, and suggest instead a bitter lesson for SSL: that augmentation diversity and data / model scale are more critical contributors to recent advances in self-supervised learning.
User-LLM: Efficient LLM Contextualization with User Embeddings
Ning, Lin, Liu, Luyang, Wu, Jiaxing, Wu, Neo, Berlowitz, Devora, Prakash, Sushant, Green, Bradley, O'Banion, Shawn, Xie, Jun
Large language models (LLMs) have revolutionized natural language processing. However, effectively incorporating complex and potentially noisy user interaction data remains a challenge. To address this, we propose User-LLM, a novel framework that leverages user embeddings to contextualize LLMs. These embeddings, distilled from diverse user interactions using self-supervised pretraining, capture latent user preferences and their evolution over time. We integrate these user embeddings with LLMs through cross-attention and soft-prompting, enabling LLMs to dynamically adapt to user context. Our comprehensive experiments on MovieLens, Amazon Review, and Google Local Review datasets demonstrate significant performance gains across various tasks. Notably, our approach outperforms text-prompt-based contextualization on long sequence tasks and tasks that require deep user understanding while being computationally efficient. We further incorporate Perceiver layers to streamline the integration between user encoders and LLMs, reducing computational demands.
Towards Generalist Biomedical AI
Tu, Tao, Azizi, Shekoofeh, Driess, Danny, Schaekermann, Mike, Amin, Mohamed, Chang, Pi-Chuan, Carroll, Andrew, Lau, Chuck, Tanno, Ryutaro, Ktena, Ira, Mustafa, Basil, Chowdhery, Aakanksha, Liu, Yun, Kornblith, Simon, Fleet, David, Mansfield, Philip, Prakash, Sushant, Wong, Renee, Virmani, Sunny, Semturs, Christopher, Mahdavi, S Sara, Green, Bradley, Dominowska, Ewa, Arcas, Blaise Aguera y, Barral, Joelle, Webster, Dale, Corrado, Greg S., Matias, Yossi, Singhal, Karan, Florence, Pete, Karthikesalingam, Alan, Natarajan, Vivek
Medicine is inherently multimodal, with rich data modalities spanning text, imaging, genomics, and more. Generalist biomedical artificial intelligence (AI) systems that flexibly encode, integrate, and interpret this data at scale can potentially enable impactful applications ranging from scientific discovery to care delivery. To enable the development of these models, we first curate MultiMedBench, a new multimodal biomedical benchmark. MultiMedBench encompasses 14 diverse tasks such as medical question answering, mammography and dermatology image interpretation, radiology report generation and summarization, and genomic variant calling. We then introduce Med-PaLM Multimodal (Med-PaLM M), our proof of concept for a generalist biomedical AI system. Med-PaLM M is a large multimodal generative model that flexibly encodes and interprets biomedical data including clinical language, imaging, and genomics with the same set of model weights. Med-PaLM M reaches performance competitive with or exceeding the state of the art on all MultiMedBench tasks, often surpassing specialist models by a wide margin. We also report examples of zero-shot generalization to novel medical concepts and tasks, positive transfer learning across tasks, and emergent zero-shot medical reasoning. To further probe the capabilities and limitations of Med-PaLM M, we conduct a radiologist evaluation of model-generated (and human) chest X-ray reports and observe encouraging performance across model scales. In a side-by-side ranking on 246 retrospective chest X-rays, clinicians express a pairwise preference for Med-PaLM M reports over those produced by radiologists in up to 40.50% of cases, suggesting potential clinical utility. While considerable work is needed to validate these models in real-world use cases, our results represent a milestone towards the development of generalist biomedical AI systems.
Towards Expert-Level Medical Question Answering with Large Language Models
Singhal, Karan, Tu, Tao, Gottweis, Juraj, Sayres, Rory, Wulczyn, Ellery, Hou, Le, Clark, Kevin, Pfohl, Stephen, Cole-Lewis, Heather, Neal, Darlene, Schaekermann, Mike, Wang, Amy, Amin, Mohamed, Lachgar, Sami, Mansfield, Philip, Prakash, Sushant, Green, Bradley, Dominowska, Ewa, Arcas, Blaise Aguera y, Tomasev, Nenad, Liu, Yun, Wong, Renee, Semturs, Christopher, Mahdavi, S. Sara, Barral, Joelle, Webster, Dale, Corrado, Greg S., Matias, Yossi, Azizi, Shekoofeh, Karthikesalingam, Alan, Natarajan, Vivek
Recent artificial intelligence (AI) systems have reached milestones in "grand challenges" ranging from Go to protein-folding. The capability to retrieve medical knowledge, reason over it, and answer medical questions comparably to physicians has long been viewed as one such grand challenge. Large language models (LLMs) have catalyzed significant progress in medical question answering; Med-PaLM was the first model to exceed a "passing" score in US Medical Licensing Examination (USMLE) style questions with a score of 67.2% on the MedQA dataset. However, this and other prior work suggested significant room for improvement, especially when models' answers were compared to clinicians' answers. Here we present Med-PaLM 2, which bridges these gaps by leveraging a combination of base LLM improvements (PaLM 2), medical domain finetuning, and prompting strategies including a novel ensemble refinement approach. Med-PaLM 2 scored up to 86.5% on the MedQA dataset, improving upon Med-PaLM by over 19% and setting a new state-of-the-art. We also observed performance approaching or exceeding state-of-the-art across MedMCQA, PubMedQA, and MMLU clinical topics datasets. We performed detailed human evaluations on long-form questions along multiple axes relevant to clinical applications. In pairwise comparative ranking of 1066 consumer medical questions, physicians preferred Med-PaLM 2 answers to those produced by physicians on eight of nine axes pertaining to clinical utility (p < 0.001). We also observed significant improvements compared to Med-PaLM on every evaluation axis (p < 0.001) on newly introduced datasets of 240 long-form "adversarial" questions to probe LLM limitations. While further studies are necessary to validate the efficacy of these models in real-world settings, these results highlight rapid progress towards physician-level performance in medical question answering.
Federated Training of Dual Encoding Models on Small Non-IID Client Datasets
Vemulapalli, Raviteja, Morningstar, Warren Richard, Mansfield, Philip Andrew, Eichner, Hubert, Singhal, Karan, Afkanpour, Arash, Green, Bradley
Dual encoding models that encode a pair of inputs are widely used for representation learning. Many approaches train dual encoding models by maximizing agreement between pairs of encodings on centralized training data. However, in many scenarios, datasets are inherently decentralized across many clients (user devices or organizations) due to privacy concerns, motivating federated learning. In this work, we focus on federated training of dual encoding models on decentralized data composed of many small, non-IID (independent and identically distributed) client datasets. We show that existing approaches that work well in centralized settings perform poorly when naively adapted to this setting using federated averaging. We observe that, we can simulate large-batch loss computation on individual clients for loss functions that are based on encoding statistics. Based on this insight, we propose a novel federated training approach, Distributed Cross Correlation Optimization (DCCO), which trains dual encoding models using encoding statistics aggregated across clients, without sharing individual data samples. Our experimental results on two datasets demonstrate that the proposed DCCO approach outperforms federated variants of existing approaches by a large margin.
Contrastive Learning for Label-Efficient Semantic Segmentation
Zhao, Xiangyun, Vemulapalli, Raviteja, Mansfield, Philip, Gong, Boqing, Green, Bradley, Shapira, Lior, Wu, Ying
Collecting labeled data for the task of semantic segmentation is expensive and time-consuming, as it requires dense pixel-level annotations. While recent Convolutional Neural Network (CNN) based semantic segmentation approaches have achieved impressive results by using large amounts of labeled training data, their performance drops significantly as the amount of labeled data decreases. This happens because deep CNNs trained with the de facto cross-entropy loss can easily overfit to small amounts of labeled data. To address this issue, we propose a simple and effective contrastive learning-based training strategy in which we first pretrain the network using a pixel-wise class label-based contrastive loss, and then fine-tune it using the cross-entropy loss. This approach increases intra-class compactness and inter-class separability thereby resulting in a better pixel classifier. We demonstrate the effectiveness of the proposed training strategy in both fully-supervised and semi-supervised settings using the Cityscapes and PASCAL VOC 2012 segmentation datasets. Our results show that pretraining with label-based contrastive loss results in large performance gains (more than 20% absolute improvement in some settings) when the amount of labeled data is limited.