sbm-transformer
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- North America > United States > Texas > Travis County > Austin (0.04)
- North America > United States > Oregon > Multnomah County > Portland (0.04)
- (4 more...)
Transformers meet Stochastic Block Models: Attention with Data-Adaptive Sparsity and Cost
To overcome the quadratic cost of self-attention, recent works have proposed various sparse attention modules, most of which fall under one of two groups: 1) sparse attention under a hand-crafted patterns and 2) full attention followed by a sparse variant of softmax such as $\alpha$-entmax. Unfortunately, the first group lacks adaptability to data while the second still requires quadratic cost in training. In this work, we propose SBM-Transformer, a model that resolves both problems by endowing each attention head with a mixed-membership Stochastic Block Model (SBM). Then, each attention head data-adaptively samples a bipartite graph, the adjacency of which is used as an attention mask for each input. During backpropagation, a straight-through estimator is used to flow gradients beyond the discrete sampling step and adjust the probabilities of sampled edges based on the predictive loss. The forward and backward cost are thus linear to the number of edges, which each attention head can also choose flexibly based on the input. By assessing the distribution of graphs, we theoretically show that SBM-Transformer is a universal approximator for arbitrary sequence-to-sequence functions in expectation. Empirical evaluations under the LRA and GLUE benchmarks demonstrate that our model outperforms previous efficient variants as well as the original Transformer with full attention.
DCMM-Transformer: Degree-Corrected Mixed-Membership Attention for Medical Imaging
Cheng, Huimin, Yu, Xiaowei, Wu, Shushan, Fang, Luyang, Cao, Chao, Zhang, Jing, Liu, Tianming, Zhu, Dajiang, Zhong, Wenxuan, Ma, Ping
Medical images exhibit latent anatomical groupings, such as organs, tissues, and pathological regions, that standard Vision Transformers (ViTs) fail to exploit. While recent work like SBM-Transformer attempts to incorporate such structures through stochastic binary masking, they suffer from non-differentiability, training instability, and the inability to model complex community structure. We present DCMM-Transformer, a novel ViT architecture for medical image analysis that incorporates a Degree-Corrected Mixed-Membership (DCMM) model as an additive bias in self-attention. Unlike prior approaches that rely on multiplicative masking and binary sampling, our method introduces community structure and degree heterogeneity in a fully differentiable and interpretable manner. Comprehensive experiments across diverse medical imaging datasets, including brain, chest, breast, and ocular modalities, demonstrate the superior performance and generalizability of the proposed approach. Furthermore, the learned group structure and structured attention modulation substantially enhance interpretability by yielding attention maps that are anatomically meaningful and semantically coherent.
- North America > United States > Texas (0.04)
- North America > United States > Missouri (0.04)
- Europe > Portugal > Porto > Porto (0.04)
- Health & Medicine > Health Care Technology (1.00)
- Health & Medicine > Diagnostic Medicine > Imaging (1.00)
- Information Technology > Sensing and Signal Processing > Image Processing (1.00)
- Information Technology > Artificial Intelligence > Vision (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.68)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.46)
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- North America > United States > Texas > Travis County > Austin (0.04)
- North America > United States > Oregon > Multnomah County > Portland (0.04)
- (4 more...)
Transformers meet Stochastic Block Models: Attention with Data-Adaptive Sparsity and Cost
To overcome the quadratic cost of self-attention, recent works have proposed various sparse attention modules, most of which fall under one of two groups: 1) sparse attention under a hand-crafted patterns and 2) full attention followed by a sparse variant of softmax such as \alpha -entmax. Unfortunately, the first group lacks adaptability to data while the second still requires quadratic cost in training. In this work, we propose SBM-Transformer, a model that resolves both problems by endowing each attention head with a mixed-membership Stochastic Block Model (SBM). Then, each attention head data-adaptively samples a bipartite graph, the adjacency of which is used as an attention mask for each input. During backpropagation, a straight-through estimator is used to flow gradients beyond the discrete sampling step and adjust the probabilities of sampled edges based on the predictive loss.
Transformers meet Stochastic Block Models: Attention with Data-Adaptive Sparsity and Cost
Cho, Sungjun, Min, Seonwoo, Kim, Jinwoo, Lee, Moontae, Lee, Honglak, Hong, Seunghoon
To overcome the quadratic cost of self-attention, recent works have proposed various sparse attention modules, most of which fall under one of two groups: 1) sparse attention under a hand-crafted patterns and 2) full attention followed by a sparse variant of softmax such as $\alpha$-entmax. Unfortunately, the first group lacks adaptability to data while the second still requires quadratic cost in training. In this work, we propose SBM-Transformer, a model that resolves both problems by endowing each attention head with a mixed-membership Stochastic Block Model (SBM). Then, each attention head data-adaptively samples a bipartite graph, the adjacency of which is used as an attention mask for each input. During backpropagation, a straight-through estimator is used to flow gradients beyond the discrete sampling step and adjust the probabilities of sampled edges based on the predictive loss. The forward and backward cost are thus linear to the number of edges, which each attention head can also choose flexibly based on the input. By assessing the distribution of graphs, we theoretically show that SBM-Transformer is a universal approximator for arbitrary sequence-to-sequence functions in expectation. Empirical evaluations under the LRA and GLUE benchmarks demonstrate that our model outperforms previous efficient variants as well as the original Transformer with full attention. Our implementation can be found in https://github.com/sc782/SBM-Transformer .
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- North America > United States > Texas > Travis County > Austin (0.04)
- North America > United States > Oregon > Multnomah County > Portland (0.04)
- (4 more...)
- Information Technology > Sensing and Signal Processing > Image Processing (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Natural Language (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.67)