Not enough data to create a plot.
Try a different view from the menu above.
Wang, Zedong
Unveiling the Backbone-Optimizer Coupling Bias in Visual Representation Learning
Li, Siyuan, Tian, Juanxi, Wang, Zedong, Zhang, Luyuan, Liu, Zicheng, Jin, Weiyang, Liu, Yang, Sun, Baigui, Li, Stan Z.
This paper delves into the interplay between vision backbones and optimizers, unvealing an inter-dependent phenomenon termed \textit{\textbf{b}ackbone-\textbf{o}ptimizer \textbf{c}oupling \textbf{b}ias} (BOCB). We observe that canonical CNNs, such as VGG and ResNet, exhibit a marked co-dependency with SGD families, while recent architectures like ViTs and ConvNeXt share a tight coupling with the adaptive learning rate ones. We further show that BOCB can be introduced by both optimizers and certain backbone designs and may significantly impact the pre-training and downstream fine-tuning of vision models. Through in-depth empirical analysis, we summarize takeaways on recommended optimizers and insights into robust vision backbone architectures. We hope this work can inspire the community to question long-held assumptions on backbones and optimizers, stimulate further explorations, and thereby contribute to more robust vision systems. The source code and models are publicly available at https://bocb-ai.github.io/.
Short-Long Convolutions Help Hardware-Efficient Linear Attention to Focus on Long Sequences
Liu, Zicheng, Li, Siyuan, Wang, Li, Wang, Zedong, Liu, Yunfan, Li, Stan Z.
To mitigate the computational complexity in the self-attention mechanism on long sequences, linear attention utilizes computation tricks to achieve linear complexity, while state space models (SSMs) popularize a favorable practice of using non-data-dependent memory pattern, i.e., emphasize the near and neglect the distant, to processing sequences. Recent studies have shown the priorities by combining them as one. However, the efficiency of linear attention remains only at the theoretical level in a causal setting, and SSMs require various designed constraints to operate effectively on specific data. Therefore, in order to unveil the true power of the hybrid design, the following two issues need to be addressed: (1) hardware-efficient implementation for linear attention and (2) stabilization of SSMs. To achieve this, we leverage the thought of tiling and hierarchy to propose CHELA (short-long Convolutions with Hardware-Efficient Linear Attention), which replaces SSMs with short-long convolutions and implements linear attention in a divide-and-conquer manner. This approach enjoys global abstraction and data-dependent selection from stable SSM and linear attention while maintaining real linear complexity. Our comprehensive experiments on the Long Range Arena benchmark and language modeling tasks demonstrate the effectiveness of the proposed method.
VQDNA: Unleashing the Power of Vector Quantization for Multi-Species Genomic Sequence Modeling
Li, Siyuan, Wang, Zedong, Liu, Zicheng, Wu, Di, Tan, Cheng, Zheng, Jiangbin, Huang, Yufei, Li, Stan Z.
Similar to natural language models, pre-trained genome language models are proposed to capture the underlying intricacies within genomes with unsupervised sequence modeling. They have become essential tools for researchers and practitioners in biology. However, the hand-crafted tokenization policies used in these models may not encode the most discriminative patterns from the limited vocabulary of genomic data. In this paper, we introduce VQDNA, a general-purpose framework that renovates genome tokenization from the perspective of genome vocabulary learning. By leveraging vector-quantized codebooks as learnable vocabulary, VQDNA can adaptively tokenize genomes into pattern-aware embeddings in an end-to-end manner. To further push its limits, we propose Hierarchical Residual Quantization (HRQ), where varying scales of codebooks are designed in a hierarchy to enrich the genome vocabulary in a coarse-to-fine manner. Extensive experiments on 32 genome datasets demonstrate VQDNA's superiority and favorable parameter efficiency compared to existing genome language models. Notably, empirical analysis of SARS-CoV-2 mutations reveals the fine-grained pattern awareness and biological significance of learned HRQ vocabulary, highlighting its untapped potential for broader applications in genomics.
LongVQ: Long Sequence Modeling with Vector Quantization on Structured Memory
Liu, Zicheng, Wang, Li, Li, Siyuan, Wang, Zedong, Lin, Haitao, Li, Stan Z.
Transformer models have been successful in various sequence processing tasks, but the self-attention mechanism's computational cost limits its practicality for long sequences. Although there are existing attention variants that improve computational efficiency, they have a limited ability to abstract global information effectively based on their hand-crafted mixing strategies. On the other hand, state-space models (SSMs) are tailored for long sequences but cannot capture complicated local information. Therefore, the combination of them as a unified token mixer is a trend in recent long-sequence models. However, the linearized attention degrades performance significantly even when equipped with SSMs. To address the issue, we propose a new method called LongVQ. LongVQ uses the vector quantization (VQ) technique to compress the global abstraction as a length-fixed codebook, enabling the linear-time computation of the attention matrix. This technique effectively maintains dynamic global and local patterns, which helps to complement the lack of long-range dependency issues. Our experiments on the Long Range Arena benchmark, autoregressive language modeling, and image and speech classification demonstrate the effectiveness of LongVQ. Our model achieves significant improvements over other sequence models, including variants of Transformers, Convolutions, and recent State Space Models.
Switch EMA: A Free Lunch for Better Flatness and Sharpness
Li, Siyuan, Liu, Zicheng, Tian, Juanxi, Wang, Ge, Wang, Zedong, Jin, Weiyang, Wu, Di, Tan, Cheng, Lin, Tao, Liu, Yang, Sun, Baigui, Li, Stan Z.
From both theoretical and empirical aspects, we demonstrate The complexity and high-dimensional parameter space of that SEMA can help DNNs to reach generalization modern DNNs has posed great challenges in optimization, optima that better trade-off between such as gradient vanishing or exploding, overfitting, and degeneration flatness and sharpness. To verify the effectiveness of large batch size (You et al., 2020). To address of SEMA, we conduct comparison experiments these obstacles, two branches of research have been conducted: with discriminative, generative, and regression improving optimizers or enhancing optimization by tasks on vision and language datasets, including regularization techniques. According to their characteristics image classification, self-supervised learning, object in Tab. 1, the improved optimizers (Kingma & Ba, 2014; detection and segmentation, image generation, Loshchilov & Hutter, 2019; Ginsburg et al., 2018; Zhang video prediction, attribute regression, and et al., 2019; Foret et al., 2021) tend to be more expensive language modeling. Comprehensive results with and focus on sharpness(deeper optimal) by refining the gradient, popular optimizers and networks show that SEMA while the popular regularizations (Srivastava et al., is a free lunch for DNN training by improving performances 2014; Cubuk et al., 2019; Zhang et al., 2018; Izmailov et al., and boosting convergence speeds.
Masked Modeling for Self-supervised Representation Learning on Vision and Beyond
Li, Siyuan, Zhang, Luyuan, Wang, Zedong, Wu, Di, Wu, Lirong, Liu, Zicheng, Xia, Jun, Tan, Cheng, Liu, Yang, Sun, Baigui, Li, Stan Z.
As the deep learning revolution marches on, self-supervised learning has garnered increasing attention in recent years thanks to its remarkable representation learning ability and the low dependence on labeled data. Among these varied self-supervised techniques, masked modeling has emerged as a distinctive approach that involves predicting parts of the original data that are proportionally masked during training. This paradigm enables deep models to learn robust representations and has demonstrated exceptional performance in the context of computer vision, natural language processing, and other modalities. In this survey, we present a comprehensive review of the masked modeling framework and its methodology. We elaborate on the details of techniques within masked modeling, including diverse masking strategies, recovering targets, network architectures, and more. Then, we systematically investigate its wide-ranging applications across domains. Furthermore, we also explore the commonalities and differences between masked modeling methods in different fields. Toward the end of this paper, we conclude by discussing the limitations of current techniques and point out several potential avenues for advancing masked modeling research. A paper list project with this survey is available at \url{https://github.com/Lupin1998/Awesome-MIM}.
OpenSTL: A Comprehensive Benchmark of Spatio-Temporal Predictive Learning
Tan, Cheng, Li, Siyuan, Gao, Zhangyang, Guan, Wenfei, Wang, Zedong, Liu, Zicheng, Wu, Lirong, Li, Stan Z.
Spatio-temporal predictive learning is a learning paradigm that enables models to learn spatial and temporal patterns by predicting future frames from given past frames in an unsupervised manner. Despite remarkable progress in recent years, a lack of systematic understanding persists due to the diverse settings, complex implementation, and difficult reproducibility. Without standardization, comparisons can be unfair and insights inconclusive. To address this dilemma, we propose OpenSTL, a comprehensive benchmark for spatio-temporal predictive learning that categorizes prevalent approaches into recurrent-based and recurrent-free models. OpenSTL provides a modular and extensible framework implementing various state-of-the-art methods. We conduct standard evaluations on datasets across various domains, including synthetic moving object trajectory, human motion, driving scenes, traffic flow and weather forecasting. Based on our observations, we provide a detailed analysis of how model architecture and dataset properties affect spatio-temporal predictive learning performance. Surprisingly, we find that recurrent-free models achieve a good balance between efficiency and performance than recurrent models. Thus, we further extend the common MetaFormers to boost recurrent-free spatial-temporal predictive learning. We open-source the code and models at https://github.com/chengtan9907/OpenSTL.
SemiReward: A General Reward Model for Semi-supervised Learning
Li, Siyuan, Jin, Weiyang, Wang, Zedong, Wu, Fang, Liu, Zicheng, Tan, Cheng, Li, Stan Z.
Semi-supervised learning (SSL) has witnessed great progress with various improvements in the self-training framework with pseudo labeling. The main challenge is how to distinguish high-quality pseudo labels against the confirmation bias. However, existing pseudo-label selection strategies are limited to pre-defined schemes or complex hand-crafted policies specially designed for classification, failing to achieve high-quality labels, fast convergence, and task versatility simultaneously. To these ends, we propose a Semi-supervised Reward framework (SemiReward) that predicts reward scores to evaluate and filter out high-quality pseudo labels, which is pluggable to mainstream SSL methods in wide task types and scenarios. To mitigate confirmation bias, SemiReward is trained online in two stages with a generator model and subsampling strategy. With classification and regression tasks on 13 standard SSL benchmarks of three modalities, extensive experiments verify that SemiReward achieves significant performance gains and faster convergence speeds upon Pseudo Label, FlexMatch, and Free/SoftMatch.
Efficient Multi-order Gated Aggregation Network
Li, Siyuan, Wang, Zedong, Liu, Zicheng, Tan, Cheng, Lin, Haitao, Wu, Di, Chen, Zhiyuan, Zheng, Jiangbin, Li, Stan Z.
Since the recent success of Vision Transformers (ViTs), explorations toward ViT-style architectures have triggered the resurgence of ConvNets. In this work, we explore the representation ability of modern ConvNets from a novel view of multi-order game-theoretic interaction, which reflects inter-variable interaction effects w.r.t.~contexts of different scales based on game theory. Within the modern ConvNet framework, we tailor the two feature mixers with conceptually simple yet effective depthwise convolutions to facilitate middle-order information across spatial and channel spaces respectively. In this light, a new family of pure ConvNet architecture, dubbed MogaNet, is proposed, which shows excellent scalability and attains competitive results among state-of-the-art models with more efficient use of parameters on ImageNet and multifarious typical vision benchmarks, including COCO object detection, ADE20K semantic segmentation, 2D\&3D human pose estimation, and video prediction. Typically, MogaNet hits 80.0\% and 87.8\% top-1 accuracy with 5.2M and 181M parameters on ImageNet, outperforming ParC-Net-S and ConvNeXt-L while saving 59\% FLOPs and 17M parameters. The source code is available at \url{https://github.com/Westlake-AI/MogaNet}.