ieee cvpr
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.14)
- North America > Canada > Quebec > Montreal (0.04)
- Research Report > New Finding (0.68)
- Research Report > Experimental Study (0.46)
- Asia > China > Fujian Province > Xiamen (0.04)
- North America > Canada (0.04)
- Asia > China > Guangdong Province > Shenzhen (0.04)
- Asia > China > Beijing > Beijing (0.04)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
- Information Technology > Artificial Intelligence > Vision (0.94)
- Information Technology > Artificial Intelligence > Representation & Reasoning (0.93)
- Information Technology > Sensing and Signal Processing > Image Processing (0.69)
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.14)
- North America > Canada > Quebec > Montreal (0.04)
- Research Report > New Finding (0.68)
- Research Report > Experimental Study (0.46)
Linear Differential Vision Transformer: Learning Visual Contrasts via Pairwise Differentials
Pu, Yifan, Ying, Jixuan, Li, Qixiu, Ye, Tianzhu, Han, Dongchen, Wang, Xiaochen, Wang, Ziyi, Shao, Xinyu, Huang, Gao, Li, Xiu
Vision Transformers (ViTs) have become a universal backbone for both image recognition and image generation. Yet their Multi-Head Self-Attention (MHSA) layer still performs a quadratic query-key interaction for every token pair, spending the bulk of computation on visually weak or redundant correlations. We introduce Visual-Contrast Attention (VCA), a drop-in replacement for MHSA that injects an explicit notion of discrimination while reducing the theoretical complexity from O(N N C) to O(N n C) with n << N. VCA first distils each head's dense query field into a handful of spatially pooled visual-contrast tokens, then splits them into a learnable positive and negative stream whose differential interaction highlights what truly separates one region from another. The module adds fewer than 0.3M parameters to a DeiT-Tiny backbone, requires no extra FLOPs, and is wholly architecture-agnostic. Empirically, VCA lifts DeiT-Tiny top-1 accuracy on ImageNet-1K from 72.2% to 75.6% (+3.4) and improves three strong hierarchical ViTs by up to 3.1%, while in class-conditional ImageNet generation it lowers FID-50K by 2.1 to 5.2 points across both diffusion (DiT) and flow (SiT) models. Extensive ablations confirm that (i) spatial pooling supplies low-variance global cues, (ii) dual positional embeddings are indispensable for contrastive reasoning, and (iii) combining the two in both stages yields the strongest synergy. VCA therefore offers a simple path towards faster and sharper Vision Transformers. The source code is available at https://github.com/LeapLabTHU/LinearDiff.
- North America > United States (0.15)
- Asia > China (0.04)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
- Information Technology > Artificial Intelligence > Vision (0.94)
- Information Technology > Artificial Intelligence > Representation & Reasoning (0.93)
- Information Technology > Sensing and Signal Processing > Image Processing (0.69)
- Asia > China > Fujian Province > Xiamen (0.04)
- North America > Canada (0.04)
- Asia > China > Guangdong Province > Shenzhen (0.04)
- Asia > China > Beijing > Beijing (0.04)
- Information Technology > Sensing and Signal Processing > Image Processing (1.00)
- Information Technology > Artificial Intelligence > Vision (1.00)
- Information Technology > Artificial Intelligence > Natural Language (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.94)
Recursive Context Propagation Network for Semantic Scene Labeling
Abhishek Sharma, Oncel Tuzel, Ming-Yu Liu
We propose a deep feed-forward neural network architecture for pixel-wise semantic scene labeling. It uses a novel recursive neural network architecture for context propagation, referred to as rCPN. It first maps the local visual features into a semantic space followed by a bottom-up aggregation of local information into a global representation of the entire image. Then a top-down propagation of the aggregated information takes place that enhances the contextual information of each local feature. Therefore, the information from every location in the image is propagated to every other location. Experimental results on Stanford background and SIFT Flow datasets show that the proposed method outperforms previous approaches. It is also orders of magnitude faster than previous methods and takes only 0.07 seconds on a GPU for pixel-wise labeling of a 256 256 image starting from raw RGB pixel values, given the super-pixel mask that takes an additional 0.3 seconds using an off-the-shelf implementation.
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
- North America > United States > Maryland > Prince George's County > College Park (0.04)
Prior2Posterior: Model Prior Correction for Long-Tailed Learning
Bhat, S Divakar, More, Amit, Soni, Mudit, Agrawal, Surbhi
Learning-based solutions for long-tailed recognition face difficulties in generalizing on balanced test datasets. Due to imbalanced data prior, the learned \textit{a posteriori} distribution is biased toward the most frequent (head) classes, leading to an inferior performance on the least frequent (tail) classes. In general, the performance can be improved by removing such a bias by eliminating the effect of imbalanced prior modeled using the number of class samples (frequencies). We first observe that the \textit{effective prior} on the classes, learned by the model at the end of the training, can differ from the empirical prior obtained using class frequencies. Thus, we propose a novel approach to accurately model the effective prior of a trained model using \textit{a posteriori} probabilities. We propose to correct the imbalanced prior by adjusting the predicted \textit{a posteriori} probabilities (Prior2Posterior: P2P) using the calculated prior in a post-hoc manner after the training, and show that it can result in improved model performance. We present theoretical analysis showing the optimality of our approach for models trained with naive cross-entropy loss as well as logit adjusted loss. Our experiments show that the proposed approach achieves new state-of-the-art (SOTA) on several benchmark datasets from the long-tail literature in the category of logit adjustment methods. Further, the proposed approach can be used to inspect any existing method to capture the \textit{effective prior} and remove any residual bias to improve its performance, post-hoc, without model retraining. We also show that by using the proposed post-hoc approach, the performance of many existing methods can be improved further.
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.67)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty > Bayesian Inference (0.46)
- Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (0.46)