Goto

Collaborating Authors

 msn



MINER: Mining the Underlying Pattern of Modality-Specific Neurons in Multimodal Large Language Models

Huang, Kaichen, Huo, Jiahao, Yan, Yibo, Wang, Kun, Yue, Yutao, Hu, Xuming

arXiv.org Artificial Intelligence

In recent years, multimodal large language models (MLLMs) have significantly advanced, integrating more modalities into diverse applications. However, the lack of explainability remains a major barrier to their use in scenarios requiring decision transparency. Current neuron-level explanation paradigms mainly focus on knowledge localization or language-and domain-specific analyses, leaving the exploration of multimodality largely unaddressed. To tackle these challenges, we propose MINER, a transferable framework for mining modality-specific neurons (MSNs) in MLLMs, which comprises four stages: modality separation, importance score calculation, importance score aggregation, modality-specific neuron selection. Extensive experiments across six benchmarks and two representative MLLMs show that (I) deactivating ONLY 2% of MSNs significantly reduces MLLMs performance (0.56 0.24 for Qwen2-VL, 0.69 0.31 for Qwen2-Audio), (II) different modalities mainly converge in the lower layers, (III) MSNs influence how key information from various modalities converges to the last token, (IV)two intriguing phenomena worth further investigation, i.e., semantic probing and semantic telomeres. The source code is available at this URL. Xiao et al., 2024; Yan et al., 2024), exemplified However, their black-box nature presents challenges, particularly in fields like medical studies (González-Alday et al., 2023), where interpretability is essential. Understanding the decision-making process is vital, making explainability a central focus of ongoing research (Tjoa & Guan, 2020; Zhao et al., 2024). Numerous studies have sought to understand how knowledge is stored in models (Sukhbaatar et al., 2019; Dai et al., 2021; Meng et al., 2022a; Chen et al., 2024a) and how this information influences decision-making (Geva et al., 2020; Petroni et al., 2019). For example, Dai et al. (2021); Geva et al. (2020) investigate knowledge storage mechanisms, while Wendler et al. (2024); Zhang et al. (2024) provide insights into layer-level explainability.


Make Some Noise: Unlocking Language Model Parallel Inference Capability through Noisy Training

Wang, Yixuan, Luo, Xianzhen, Wei, Fuxuan, Liu, Yijun, Zhu, Qingfu, Zhang, Xuanyu, Yang, Qing, Xu, Dongliang, Che, Wanxiang

arXiv.org Artificial Intelligence

Existing speculative decoding methods typically require additional model structure and training processes to assist the model for draft token generation. This makes the migration of acceleration methods to the new model more costly and more demanding on device memory. To address this problem, we propose the Make Some Noise (MSN) training framework as a replacement for the supervised fine-tuning stage of the large language model. The training method simply introduces some noise at the input for the model to learn the denoising task. It significantly enhances the parallel decoding capability of the model without affecting the original task capability. In addition, we propose a tree-based retrieval-augmented Jacobi (TR-Jacobi) decoding strategy to further improve the inference speed of MSN models. Experiments in both the general and code domains have shown that MSN can improve inference speed by 2.3-2.7x times without compromising model performance. The MSN model also achieves comparable acceleration ratios to the SOTA model with additional model structure on Spec-Bench.


Predicting Generalization of AI Colonoscopy Models to Unseen Data

Shor, Joel, McNeil, Carson, Intrator, Yotam, Ledsam, Joseph R, Yamano, Hiro-o, Tsurumaru, Daisuke, Kayama, Hiroki, Hamabe, Atsushi, Ando, Koji, Ota, Mitsuhiko, Ogino, Haruei, Nakase, Hiroshi, Kobayashi, Kaho, Miyo, Masaaki, Oki, Eiji, Takemasa, Ichiro, Rivlin, Ehud, Goldenberg, Roman

arXiv.org Artificial Intelligence

$\textbf{Background}$: Generalizability of AI colonoscopy algorithms is important for wider adoption in clinical practice. However, current techniques for evaluating performance on unseen data require expensive and time-intensive labels. $\textbf{Methods}$: We use a "Masked Siamese Network" (MSN) to identify novel phenomena in unseen data and predict polyp detector performance. MSN is trained to predict masked out regions of polyp images, without any labels. We test MSN's ability to be trained on data only from Israel and detect unseen techniques, narrow-band imaging (NBI) and chromendoscoy (CE), on colonoscopes from Japan (354 videos, 128 hours). We also test MSN's ability to predict performance of Computer Aided Detection (CADe) of polyps on colonoscopies from both countries, even though MSN is not trained on data from Japan. $\textbf{Results}$: MSN correctly identifies NBI and CE as less similar to Israel whitelight than Japan whitelight (bootstrapped z-test, |z| > 496, p < 10^-8 for both) using the label-free Frechet distance. MSN detects NBI with 99% accuracy, predicts CE better than our heuristic (90% vs 79% accuracy) despite being trained only on whitelight, and is the only method that is robust to noisy labels. MSN predicts CADe polyp detector performance on in-domain Israel and out-of-domain Japan colonoscopies (r=0.79, 0.37 respectively). With few examples of Japan detector performance to train on, MSN prediction of Japan performance improves (r=0.56). $\textbf{Conclusion}$: Our technique can identify distribution shifts in clinical data and can predict CADe detector performance on unseen data, without labels. Our self-supervised approach can aid in detecting when data in practice is different from training, such as between hospitals or data has meaningfully shifted from training. MSN has potential for application to medical image domains beyond colonoscopy.


Obituary That Called Late NBA Player 'Useless' Sparks Firestorm

Huffington Post - Tech news and opinion

Social media users hurled criticism at Microsoft this week for what many thought was an AI-generated obituary for NBA player Brandon Hunter on its website MSN. The controversy began after the obituary -- which had a headline that read "Brandon Hunter useless at 42" written by "Editor" -- appeared on the Microsoft-owned platform after Hunter's death on Tuesday. The obituary goes on to refer to the former Boston Celtics and Orlando Magic player having been "handed away on the age of 42" and claimed he "performed in 67 video games over two seasons and achieved a career-high of 17 factors in a recreation in opposition to the Milwaukee Bucks in 2004." The post appeared to follow a similar format to a story on TMZ Sports, Futurism noted, "albeit with altered punctuation and a use of synonyms so liberal that the result is essentially incomprehensible." You can compare both the obituary containing the error and the TMZ Sports story here.


Self-Supervised Learning for Endoscopic Video Analysis

Hirsch, Roy, Caron, Mathilde, Cohen, Regev, Livne, Amir, Shapiro, Ron, Golany, Tomer, Goldenberg, Roman, Freedman, Daniel, Rivlin, Ehud

arXiv.org Artificial Intelligence

Self-supervised learning (SSL) has led to important breakthroughs in computer vision by allowing learning from large amounts of unlabeled data. As such, it might have a pivotal role to play in biomedicine where annotating data requires a highly specialized expertise. Yet, there are many healthcare domains for which SSL has not been extensively explored. One such domain is endoscopy, minimally invasive procedures which are commonly used to detect and treat infections, chronic inflammatory diseases or cancer. In this work, we study the use of a leading SSL framework, namely Masked Siamese Networks (MSNs), for endoscopic video analysis such as colonoscopy and laparoscopy. To fully exploit the power of SSL, we create sizable unlabeled endoscopic video datasets for training MSNs. These strong image representations serve as a foundation for secondary training with limited annotated datasets, resulting in state-of-the-art performance in endoscopic benchmarks like surgical phase recognition during laparoscopy and colonoscopic polyp characterization. Additionally, we achieve a 50% reduction in annotated data size without sacrificing performance. Thus, our work provides evidence that SSL can dramatically reduce the need of annotated data in endoscopy.


Objectives Matter: Understanding the Impact of Self-Supervised Objectives on Vision Transformer Representations

Shekhar, Shashank, Bordes, Florian, Vincent, Pascal, Morcos, Ari

arXiv.org Artificial Intelligence

Joint-embedding based learning (e.g., SimCLR, MoCo, DINO) and reconstruction-based learning (e.g., BEiT, SimMIM, MAE) are the two leading paradigms for self-supervised learning of vision transformers, but they differ substantially in their transfer performance. Here, we aim to explain these differences by analyzing the impact of these objectives on the structure and transferability of the learned representations. Our analysis reveals that reconstruction-based learning features are significantly dissimilar to joint-embedding based learning features and that models trained with similar objectives learn similar features even across architectures. These differences arise early in the network and are primarily driven by attention and normalization layers. We find that joint-embedding features yield better linear probe transfer for classification because the different objectives drive different distributions of information and invariances in the learned representation. These differences explain opposite trends in transfer performance for downstream tasks that require spatial specificity in features. Finally, we address how fine-tuning changes reconstructive representations to enable better transfer, showing that fine-tuning re-organizes the information to be more similar to pre-trained joint embedding models.


Weighted Ensemble Self-Supervised Learning

Ruan, Yangjun, Singh, Saurabh, Morningstar, Warren, Alemi, Alexander A., Ioffe, Sergey, Fischer, Ian, Dillon, Joshua V.

arXiv.org Artificial Intelligence

Ensembling has proven to be a powerful technique for boosting model performance, uncertainty estimation, and robustness in supervised learning. Advances in self-supervised learning (SSL) enable leveraging large unlabeled corpora for state-of-the-art few-shot and supervised learning performance. In this paper, we explore how ensemble methods can improve recent SSL techniques by developing a framework that permits data-dependent weighted cross-entropy losses. We refrain from ensembling the representation backbone; this choice yields an efficient ensemble method that incurs a small training cost and requires no architectural changes or computational overhead to downstream evaluation. The effectiveness of our method is demonstrated with two state-of-the-art SSL methods, DINO (Caron et al., 2021) and MSN (Assran et al., 2022). Our method outperforms both in multiple evaluation metrics on ImageNet-1K, particularly in the few-shot setting. We explore several weighting schemes and find that those which increase the diversity of ensemble heads lead to better downstream evaluation results. Thorough experiments yield improved prior art baselines which our method still surpasses; e.g., our overall improvement with MSN ViT-B/16 is 3.9 p.p. for 1-shot learning. These successes have encouraged increasingly advanced SSL techniques (e.g., Grill et al., 2020; Zbontar et al., 2021; He et al., 2022). Perhaps surprisingly however, a simple and otherwise common idea has received limited consideration: ensembling. Ensembling combines predictions from multiple trained models and has proven effective at improving model accuracy (Hansen & Salamon, 1990; Perrone & Cooper, 1992) and capturing predictive uncertainty in supervised learning (Lakshminarayanan et al., 2017; Ovadia et al., 2019). Ensembling in the SSL regime is nuanced, however; since the goal is to learn useful representations from unlabeled data, it is less obvious where and how to ensemble. We explore these questions in this work.


The Hidden Uniform Cluster Prior in Self-Supervised Learning

Assran, Mahmoud, Balestriero, Randall, Duval, Quentin, Bordes, Florian, Misra, Ishan, Bojanowski, Piotr, Vincent, Pascal, Rabbat, Michael, Ballas, Nicolas

arXiv.org Artificial Intelligence

A successful paradigm in representation learning is to perform self-supervised pretraining using tasks based on mini-batch statistics (e.g., SimCLR, VICReg, SwAV, MSN). We show that in the formulation of all these methods is an overlooked prior to learn features that enable uniform clustering of the data. While this prior has led to remarkably semantic representations when pretraining on class-balanced data, such as ImageNet, we demonstrate that it can hamper performance when pretraining on class-imbalanced data. By moving away from conventional uniformity priors and instead preferring power-law distributed feature clusters, we show that one can improve the quality of the learned representations on real-world class-imbalanced datasets. To demonstrate this, we develop an extension of the Masked Siamese Networks (MSN) method to support the use of arbitrary features priors.


Artificial Intelligence Is Poised to Take More Than Unskilled Jobs

#artificialintelligence

Recently, Microsoft announced that it was terminating dozens of journalists and editorial workers at its Microsoft News and MSN organizations. Instead, the company said, it will rely on artificial intelligence to curate and edit news and content that is presented on MSN.com, inside Microsoft's Edge browser, and in the company's Microsoft News apps. Explaining the decision, Microsoft issued a statement to the Verge. The statement reads: "Like all companies, we evaluate our business on a regular basis. This can result in increased investment in some places and, from time to time, re-deployment in others. These decisions are not the result of the current pandemic."