AITopics | convolutional stem

Collaborating Authors

convolutional stem

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

ff1418e8cc993fe8abcfe3ce2003e5c5-Supplemental.pdf

Neural Information Processing SystemsFeb-12-2026, 02:07:07 GMT

The table ( right) shows 100 epoch results using best lr and wd values found at 50 epochs. ViT's patchify stem differs from the proposed convolutional stem in the type of convolution used and We investigate these factors next. The focus of this paper is studying the large, positive impact of changing ViT's default We use AdamW for all experiments. Figure 7 shows the results. The table ( right) shows 100 epoch results using optimal lr and wd values chosen from the 50 epoch runs.

artificial intelligence, experiment, machine learning, (17 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

EarlyConvolutionsHelpTransformersSeeBetter

Neural Information Processing SystemsFeb-12-2026, 02:07:04 GMT

This large-kernel plus large-stride convolution runs counter to typical design choices of convolutional layers in neural networks.

artificial intelligence, machine learning, vit, (19 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.70)

Add feedback

Early Convolutions Help Transformers See Better

Neural Information Processing SystemsDec-25-2025, 08:41:56 GMT

In particular, they are sensitive to the choice of optimizer (AdamW vs. SGD), optimizer hyperparameters, and training schedule length. In comparison, modern convolutional neural networks are easier to optimize. Why is this the case? In this work, we conjecture that the issue lies with the patchify stem of ViT models, which is implemented by a stride-p p p convolution (p = 16 by default) applied to the input image. This large-kernel plus large-stride convolution runs counter to typical design choices of convolutional layers in neural networks.

electronic proceedings, name change, vit model, (6 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

Add feedback

ff1418e8cc993fe8abcfe3ce2003e5c5-Supplemental.pdf

Neural Information Processing SystemsAug-19-2025, 02:50:45 GMT

artificial intelligence, experiment, machine learning, (17 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

Early Convolutions Help Transformers See Better

Neural Information Processing SystemsAug-19-2025, 02:50:42 GMT

Why is this the case?

artificial intelligence, convolutional stem, machine learning, (16 more...)

Neural Information Processing Systems

Genre: Research Report > New Finding (0.46)

Technology:

Information Technology > Artificial Intelligence > Vision (0.94)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.94)

Add feedback

Early Convolutions Help Transformers See Better

Neural Information Processing SystemsMay-27-2025, 07:27:17 GMT

In particular, they are sensitive to the choice of optimizer (AdamW vs. SGD), optimizer hyperparameters, and training schedule length. In comparison, modern convolutional neural networks are easier to optimize. Why is this the case? In this work, we conjecture that the issue lies with the patchify stem of ViT models, which is implemented by a stride-p p p convolution (p 16 by default) applied to the input image. This large-kernel plus large-stride convolution runs counter to typical design choices of convolutional layers in neural networks.

better, convolutional stem, vit model, (5 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

Add feedback

Early Convolutions Help Transformers See Better

Neural Information Processing SystemsJan-19-2025, 15:28:49 GMT

In particular, they are sensitive to the choice of optimizer (AdamW vs. SGD), optimizer hyperparameters, and training schedule length. In comparison, modern convolutional neural networks are easier to optimize. Why is this the case? In this work, we conjecture that the issue lies with the patchify stem of ViT models, which is implemented by a stride-p p p convolution (p 16 by default) applied to the input image. This large-kernel plus large-stride convolution runs counter to typical design choices of convolutional layers in neural networks.

convolutional stem, neural network, vit model, (4 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

Add feedback

High-Performance Transformers for Table Structure Recognition Need Early Convolutions

Peng, ShengYun, Lee, Seongmin, Wang, Xiaojing, Balasubramaniyan, Rajarajeswari, Chau, Duen Horng

arXiv.org Artificial IntelligenceNov-9-2023

Table structure recognition (TSR) aims to convert tabular images into a machine-readable format, where a visual encoder extracts image features and a textual decoder generates table-representing tokens. Existing approaches use classic convolutional neural network (CNN) backbones for the visual encoder and transformers for the textual decoder. However, this hybrid CNN-Transformer architecture introduces a complex visual encoder that accounts for nearly half of the total model parameters, markedly reduces both training and inference speed, and hinders the potential for self-supervised learning in TSR. In this work, we design a lightweight visual encoder for TSR without sacrificing expressive power. We discover that a convolutional stem can match classic CNN backbone performance, with a much simpler model. The convolutional stem strikes an optimal balance between two crucial factors for high-performance TSR: a higher receptive field (RF) ratio and a longer sequence length. This allows it to "see" an appropriate portion of the table and "store" the complex table structure within sufficient context length for the subsequent transformer. We conducted reproducible ablation studies and open-sourced our code at https://github.com/poloclub/tsr-convstem to enhance transparency, inspire innovations, and facilitate fair comparisons in our domain as tables are a promising modality for representation learning.

encoder, transformer, visual encoder, (13 more...)

arXiv.org Artificial Intelligence

2311.05565

Country: Europe > Switzerland > Vaud > Lausanne (0.04)

Genre: Research Report (0.50)

Technology:

Information Technology > Sensing and Signal Processing > Image Processing (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.87)

Add feedback

RangeViT: Towards Vision Transformers for 3D Semantic Segmentation in Autonomous Driving

Ando, Angelika, Gidaris, Spyros, Bursuc, Andrei, Puy, Gilles, Boulch, Alexandre, Marlet, Renaud

arXiv.org Artificial IntelligenceApr-25-2023

Casting semantic segmentation of outdoor LiDAR point clouds as a 2D problem, e.g., via range projection, is an effective and popular approach. These projection-based methods usually benefit from fast computations and, when combined with techniques which use other point cloud representations, achieve state-of-the-art results. Today, projection-based methods leverage 2D CNNs but recent advances in computer vision show that vision transformers (ViTs) have achieved state-of-the-art results in many image-based benchmarks. In this work, we question if projection-based methods for 3D semantic segmentation can benefit from these latest improvements on ViTs. We answer positively but only after combining them with three key ingredients: (a) ViTs are notoriously hard to train and require a lot of training data to learn powerful representations. By preserving the same backbone architecture as for RGB images, we can exploit the knowledge from long training on large image collections that are much cheaper to acquire and annotate than point clouds. We reach our best results with pre-trained ViTs on large image datasets. (b) We compensate ViTs' lack of inductive bias by substituting a tailored convolutional stem for the classical linear embedding layer. (c) We refine pixel-wise predictions with a convolutional decoder and a skip connection from the convolutional stem to combine low-level but fine-grained features of the the convolutional stem with the high-level but coarse predictions of the ViT encoder. With these ingredients, we show that our method, called RangeViT, outperforms existing projection-based methods on nuScenes and SemanticKITTI. The code is available at https://github.com/valeoai/rangevit.

artificial intelligence, machine learning, point cloud, (16 more...)

arXiv.org Artificial Intelligence

2301.10222

Country:

Europe > France > Île-de-France > Paris > Paris (0.04)
Europe > Germany (0.04)
Asia > Singapore (0.04)

Genre: Research Report (1.00)

Industry:

Transportation > Ground > Road (0.65)
Automobiles & Trucks (0.50)
Information Technology > Robotics & Automation (0.41)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

The power of Convolution in Vision Transformer

#artificialintelligenceMay-27-2022, 00:09:54 GMT

It is well known today that Transformers are not only used for natural language processing but plays a vital role in computer vision applications in the form of vision transformers (ViT). In fact it has been demonstrated time and time again just how powerful they are as seen by their SOTA performance. However one major drawback of vision transformers is their reliance on huge amounts of data. Another major drawback is thier below average optimizability. It has been shown that vision transformers are very sensitive particularly to the type of optimizer used (Adam vs AdamW vs SGD etc), the choice of learning hyperparameters, depth of the network, training schedule length etc. Researchers have indicated, this particular drawback is as a result of the "patchify stem" which forms the early visual processing layer which is implemented with large kernel and stride sizes (default of 16).

convolution, patchify stem, vision transformer, (7 more...)

#artificialintelligence

Technology: Information Technology > Artificial Intelligence > Vision (1.00)

Add feedback