Goto

Collaborating Authors

 Yu, Ruiji


Faster Vision Mamba is Rebuilt in Minutes via Merged Token Re-training

arXiv.org Artificial Intelligence

Vision Mamba (e.g., Vim) has successfully been integrated into computer vision, and token reduction has yielded promising outcomes in Vision Transformers (ViTs). However, token reduction performs less effectively on Vision Mamba compared to ViTs. Pruning informative tokens in Mamba leads to a high loss of key knowledge and bad performance. This makes it not a good solution for enhancing efficiency in Mamba. Token merging, which preserves more token information than pruning, has demonstrated commendable performance in ViTs. Nevertheless, vanilla merging performance decreases as the reduction ratio increases either, failing to maintain the key knowledge in Mamba. Re-training the token-reduced model enhances the performance of Mamba, by effectively rebuilding the key knowledge. Empirically, pruned Vims only drop up to 0.9% accuracy on ImageNet-1K, recovered by our proposed framework R-MeeTo in our main evaluation. We show how simple and effective the fast recovery can be achieved at minute-level, in particular, a 35.9% accuracy spike over 3 epochs of training on Vim-Ti. Moreover, Vim-Ti/S/B are re-trained within 5/7/17 minutes, and Vim-S only drop 1.3% with 1.2x (up to 1.5x) speed up in inference.


Pursuing Feature Separation based on Neural Collapse for Out-of-Distribution Detection

arXiv.org Artificial Intelligence

In the open world, deep neural networks (DNNs) encounter a diverse range of input images, including in-distribution (ID) data that shares the same distribution as the training data, and out-of-distribution (OOD) data, which has labels that are disjoint from those of the ID cases. Facing the complex input environment, a reliable network system must not only provide accurate predictions for ID data but also recognize unseen OOD data. This necessity gives rise to the critical problem of OOD detection [3, 31], which has garnered significant attention in recent years, particularly in safety-critical applications. A rich line of studies detect OOD samples by exploring the differences between ID and OOD data in terms of model outputs [13, 33], features [43, 57, 44], or gradients [15, 50]. However, it has been observed that models trained solely on ID data can make over-confident predictions on OOD data, and the features of OOD data can intermingle with those of ID features [13, 44]. To develop more effective detection algorithms, a category of works focus on the utilization of auxiliary OOD datasets, which can significantly improve detection performance on unseen OOD data. One classical method, called Outlier Exposure (OE, [14]), employs a cross-entropy loss between the outputs of OOD data and uniformly distributed labels to fine-tune the model. Additionally, Energy [33] proposes using the energy function as its training loss and designs an energy gap between ID and OOD data. Building on these proposed losses, recent works have concentrated on improving the quality of auxiliary OOD datasets through data augmentation [48, 49, 55] or data sampling [35, 5, 19] algorithms to achieve better detection performance.