Lee, Justin
Command R7B Arabic: A Small, Enterprise Focused, Multilingual, and Culturally Aware Arabic LLM
Alnumay, Yazeed, Barbet, Alexandre, Bialas, Anna, Darling, William, Desai, Shaan, Devassy, Joan, Duffy, Kyle, Howe, Stephanie, Lasche, Olivia, Lee, Justin, Shrinivason, Anirudh, Tracey, Jennifer
Building high-quality large language models (LLMs) for enterprise Arabic applications remains challenging due to the limited availability of digitized Arabic data. In this work, we present a data synthesis and refinement strategy to help address this problem, namely, by leveraging synthetic data generation and human-in-the-loop annotation to expand our Arabic training corpus. We further present our iterative post training recipe that is essential to achieving state-of-the-art performance in aligning the model with human preferences, a critical aspect to enterprise use cases. The culmination of this effort is the release of a small, 7B, open-weight model that outperforms similarly sized peers in head-to-head comparisons and on Arabic-focused benchmarks covering cultural knowledge, instruction following, RAG, and contextual faithfulness.
ContriMix: Unsupervised disentanglement of content and attribute for domain generalization in microscopy image analysis
Nguyen, Tan H., Juyal, Dinkar, Li, Jin, Prakash, Aaditya, Nofallah, Shima, Shah, Chintan, Gullapally, Sai Chowdary, Yu, Limin, Griffin, Michael, Sampat, Anand, Abel, John, Lee, Justin, Taylor-Weiner, Amaro
Domain generalization is critical for real-world applications of machine learning to microscopy images, including histopathology and fluorescence imaging. Artifacts in these modalities arise through a complex combination of factors relating to tissue collection and laboratory processing, as well as factors intrinsic to patient samples. In fluorescence imaging, these artifacts stem from variations across experimental batches. The complexity and subtlety of these artifacts make the enumeration of data domains intractable. Therefore, augmentation-based methods of domain generalization that require domain identifiers and manual fine-tuning are inadequate in this setting. To overcome this challenge, we introduce ContriMix, a domain generalization technique that learns to generate synthetic images by disentangling and permuting the biological content ("content") and technical variations ("attributes") in microscopy images. ContriMix does not rely on domain identifiers or handcrafted augmentations and makes no assumptions about the input characteristics of images. We assess the performance of ContriMix on two pathology datasets dealing with patch classification and Whole Slide Image label prediction tasks respectively (Camelyon17-WILDS and RCC subtyping), and one fluorescence microscopy dataset (RxRx1-WILDS). Without any access to domain identifiers at train or test time, ContriMix performs similar or better than current state-of-the-art methods in all these datasets, motivating its usage for microscopy image analysis in real-world settings where domain information is hard to come by. The code for ContriMix can be found at https://gitlab.com/huutan86/contrimix
The Importance of Prompt Tuning for Automated Neuron Explanations
Lee, Justin, Oikarinen, Tuomas, Chatha, Arjun, Chang, Keng-Chi, Chen, Yilan, Weng, Tsui-Wei
Recent advances have greatly increased the capabilities of large language models (LLMs), but our understanding of the models and their safety has not progressed as fast. In this paper we aim to understand LLMs deeper by studying their individual neurons. We build upon previous work showing large language models such as GPT-4 can be useful in explaining what each neuron in a language model does. Specifically, we analyze the effect of the prompt used to generate explanations and show that reformatting the explanation prompt in a more natural way can significantly improve neuron explanation quality and greatly reduce computational cost. We demonstrate the effects of our new prompts in three different ways, incorporating both automated and human evaluations.
Synthetic DOmain-Targeted Augmentation (S-DOTA) Improves Model Generalization in Digital Pathology
Gullapally, Sai Chowdary, Zhang, Yibo, Mittal, Nitin Kumar, Kartik, Deeksha, Srinivasan, Sandhya, Rose, Kevin, Shenker, Daniel, Juyal, Dinkar, Padigela, Harshith, Biju, Raymond, Minden, Victor, Maheshwari, Chirag, Thibault, Marc, Goldstein, Zvi, Novak, Luke, Chandra, Nidhi, Lee, Justin, Prakash, Aaditya, Shah, Chintan, Abel, John, Fahy, Darren, Taylor-Weiner, Amaro, Sampat, Anand
Machine learning algorithms have the potential to improve patient outcomes in digital pathology. However, generalization of these tools is currently limited by sensitivity to variations in tissue preparation, staining procedures and scanning equipment that lead to domain shift in digitized slides. To overcome this limitation and improve model generalization, we studied the effectiveness of two Synthetic DOmain-Targeted Augmentation (S-DOTA) methods, namely CycleGAN-enabled Scanner Transform (ST) and targeted Stain Vector Augmentation (SVA), and compared them against the International Color Consortium (ICC) profile-based color calibration (ICC Cal) method and a baseline method using traditional brightness, color and noise augmentations. We evaluated the ability of these techniques to improve model generalization to various tasks and settings: four models, two model types (tissue segmentation and cell classification), two loss functions, six labs, six scanners, and three indications (hepatocellular carcinoma (HCC), nonalcoholic steatohepatitis (NASH), prostate adenocarcinoma). We compared these methods based on the macro-averaged F1 scores on in-distribution (ID) and out-of-distribution (OOD) test sets across multiple domains, and found that S-DOTA methods (i.e., ST and SVA) led to significant improvements over ICC Cal and baseline on OOD data while maintaining comparable performance on ID data. Thus, we demonstrate that S-DOTA may help address generalization due to domain shift in real world applications.