AITopics | patchify stem

ff1418e8cc993fe8abcfe3ce2003e5c5-Supplemental.pdf

Neural Information Processing SystemsFeb-12-2026, 02:07:07 GMT

The table ( right) shows 100 epoch results using best lr and wd values found at 50 epochs. ViT's patchify stem differs from the proposed convolutional stem in the type of convolution used and We investigate these factors next. The focus of this paper is studying the large, positive impact of changing ViT's default We use AdamW for all experiments. Figure 7 shows the results. The table ( right) shows 100 epoch results using optimal lr and wd values chosen from the 50 epoch runs.

artificial intelligence, experiment, machine learning, (17 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

EarlyConvolutionsHelpTransformersSeeBetter

Neural Information Processing SystemsFeb-12-2026, 02:07:04 GMT

This large-kernel plus large-stride convolution runs counter to typical design choices of convolutional layers in neural networks.

artificial intelligence, machine learning, vit, (19 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.70)

Add feedback

Early Convolutions Help Transformers See Better

Neural Information Processing SystemsDec-25-2025, 08:41:56 GMT

In particular, they are sensitive to the choice of optimizer (AdamW vs. SGD), optimizer hyperparameters, and training schedule length. In comparison, modern convolutional neural networks are easier to optimize. Why is this the case? In this work, we conjecture that the issue lies with the patchify stem of ViT models, which is implemented by a stride-p p p convolution (p = 16 by default) applied to the input image. This large-kernel plus large-stride convolution runs counter to typical design choices of convolutional layers in neural networks.

electronic proceedings, name change, vit model, (6 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

Add feedback

ff1418e8cc993fe8abcfe3ce2003e5c5-Supplemental.pdf

Neural Information Processing SystemsAug-19-2025, 02:50:45 GMT

The table ( right) shows 100 epoch results using best lr and wd values found at 50 epochs. ViT's patchify stem differs from the proposed convolutional stem in the type of convolution used and We investigate these factors next. The focus of this paper is studying the large, positive impact of changing ViT's default We use AdamW for all experiments. Figure 7 shows the results. The table ( right) shows 100 epoch results using optimal lr and wd values chosen from the 50 epoch runs.

artificial intelligence, experiment, machine learning, (17 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

Early Convolutions Help Transformers See Better

Neural Information Processing SystemsAug-19-2025, 02:50:42 GMT

Why is this the case?

artificial intelligence, convolutional stem, machine learning, (16 more...)

Neural Information Processing Systems

Genre: Research Report > New Finding (0.46)

Technology:

Information Technology > Artificial Intelligence > Vision (0.94)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.94)

Add feedback

Early Convolutions Help Transformers See Better

Neural Information Processing SystemsMay-27-2025, 07:27:17 GMT

In particular, they are sensitive to the choice of optimizer (AdamW vs. SGD), optimizer hyperparameters, and training schedule length. In comparison, modern convolutional neural networks are easier to optimize. Why is this the case? In this work, we conjecture that the issue lies with the patchify stem of ViT models, which is implemented by a stride-p p p convolution (p 16 by default) applied to the input image. This large-kernel plus large-stride convolution runs counter to typical design choices of convolutional layers in neural networks.

better, convolutional stem, vit model, (5 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

Add feedback

Early Convolutions Help Transformers See Better

Neural Information Processing SystemsJan-19-2025, 15:28:49 GMT

In particular, they are sensitive to the choice of optimizer (AdamW vs. SGD), optimizer hyperparameters, and training schedule length. In comparison, modern convolutional neural networks are easier to optimize. Why is this the case? In this work, we conjecture that the issue lies with the patchify stem of ViT models, which is implemented by a stride-p p p convolution (p 16 by default) applied to the input image. This large-kernel plus large-stride convolution runs counter to typical design choices of convolutional layers in neural networks.

convolutional stem, neural network, vit model, (4 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

Add feedback

The power of Convolution in Vision Transformer

#artificialintelligenceMay-27-2022, 00:09:54 GMT

It is well known today that Transformers are not only used for natural language processing but plays a vital role in computer vision applications in the form of vision transformers (ViT). In fact it has been demonstrated time and time again just how powerful they are as seen by their SOTA performance. However one major drawback of vision transformers is their reliance on huge amounts of data. Another major drawback is thier below average optimizability. It has been shown that vision transformers are very sensitive particularly to the type of optimizer used (Adam vs AdamW vs SGD etc), the choice of learning hyperparameters, depth of the network, training schedule length etc. Researchers have indicated, this particular drawback is as a result of the "patchify stem" which forms the early visual processing layer which is implemented with large kernel and stride sizes (default of 16).

convolution, patchify stem, vision transformer, (7 more...)

#artificialintelligence

Technology: Information Technology > Artificial Intelligence > Vision (1.00)

Add feedback