AITopics | rethinking transformer

Supplementary Materials for NAR-Former V2: Rethinking Transformer for Universal Neural Network Representation Learning

Neural Information Processing SystemsFeb-17-2026, 00:38:15 GMT

Right: Normalized attention scores processed by two different normalization methods. Table 1: Performance of searched architectures using different NAS algorithms in DARTS [ 7 ] space on CIFAR-10 [ 5 ]. The inference latency was measured on a machine with GeForce RTX 3090 GPU. The batch size was set to 1. Encode(ms) Infer(ms) Total(ms) NAR-Former 2.4784 17.4864 19.9648 NAR-Former V2 2.3722 5.2276 7.5998 may be somewhat different. Due to the softmax, Eq. ( 5) focuses almost all attention on the current The Eq. ( 2) restricts attention to connected nodes by introducing the adjacency matrix.

artificial intelligence, machine learning, prediction, (12 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

Add feedback

NAR-Former V2: Rethinking Transformer for Universal Neural Network Representation Learning

Neural Information Processing SystemsDec-26-2025, 17:46:12 GMT

As more deep learning models are being applied in real-world applications, there is a growing need for modeling and learning the representations of neural networks themselves. An effective representation can be used to predict target attributes of networks without the need for actual training and deployment procedures, facilitating efficient network design and deployment. Recently, inspired by the success of Transformer, some Transformer-based representation learning frameworks have been proposed and achieved promising performance in handling cell-structured models. However, graph neural network (GNN) based approaches still dominate the field of learning representation for the entire network. In this paper, we revisit the Transformer and compare it with GNN to analyze their different architectural characteristics. We then propose a modified Transformer-based universal neural network representation learning model NAR-Former V2.

representation, rethinking transformer, transformer, (6 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

You Only Look at One Sequence: Rethinking Transformer in Vision through Object Detection

Neural Information Processing SystemsDec-25-2025, 00:48:25 GMT

Can Transformer perform $2\mathrm{D}$ object-and region-level recognition from a pure sequence-to-sequence perspective with minimal knowledge about the $2\mathrm{D}$ spatial structure? To answer this question, we present You Only Look at One Sequence (YOLOS), a series of object detection models based on the vanilla Vision Transformer with the fewest possible modifications, region priors, as well as inductive biases of the target task. We find that YOLOS pre-trained on the mid-sized ImageNet-$1k$ dataset only can already achieve quite competitive performance on the challenging COCO object detection benchmark, e.g., YOLOS-Base directly adopted from BERT-Base architecture can obtain $42.0$ box AP on COCO val. We also discuss the impacts as well as limitations of current pre-train schemes and model scaling strategies for Transformer in vision through YOLOS. Code and pre-trained models are available at https://github.com/hustvl/YOLOS.

only look, rethinking transformer, transformer, (8 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Vision (0.87)

Add feedback

CascadeXML: Rethinking Transformers for End-to-end Multi-resolution Training in Extreme Multi-label Classification

Neural Information Processing SystemsDec-23-2025, 18:33:16 GMT

Extreme Multi-label Text Classification (XMC) involves learning a classifier that can assign an input with a subset of most relevant labels from millions of label choices. Recent approaches, such as XR-Transformer and LightXML, leverage a transformer instance to achieve state-of-the-art performance. However, in this process, these approaches need to make various trade-offs between performance and computational requirements. A major shortcoming, as compared to the Bi-LSTM based AttentionXML, is that they fail to keep separate feature representations for each resolution in a label tree. We thus propose CascadeXML, an end-to-end multi-resolution learning pipeline, which can harness the multi-layered architecture of a transformer model for attending to different label resolutions with separate feature representations. CascadeXML significantly outperforms all existing approaches with non-trivial gains obtained on benchmark datasets consisting of up to three million labels.

cascadexml, end-to-end multi-resolution training, rethinking transformer, (5 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.84)

Add feedback

From Scaling to Structured Expressivity: Rethinking Transformers for CTR Prediction

Yan, Bencheng, Lei, Yuejie, Zeng, Zhiyuan, Wang, Di, Lin, Kaiyi, Wang, Pengjie, Xu, Jian, Zheng, Bo

arXiv.org Artificial IntelligenceNov-18-2025

Despite massive investments in scale, deep models for click-through rate (CTR) prediction often exhibit rapidly diminishing returns - a stark contrast to the smooth, predictable gains seen in large language models. We identify the root cause as a structural misalignment: Transformers assume sequential compositionality, while CTR data demand combinatorial reasoning over high-cardinality semantic fields. Unstructured attention spreads capacity indiscriminately, amplifying noise under extreme sparsity and breaking scalable learning. To restore alignment, we introduce the Field-Aware Transformer (FAT), which embeds field-based interaction priors into attention through decomposed content alignment and cross-field modulation. This design ensures model complexity scales with the number of fields F, not the total vocabulary size n >> F, leading to tighter generalization and, critically, observed power-law scaling in AUC as model width increases. We present the first formal scaling law for CTR models, grounded in Rademacher complexity, that explains and predicts this behavior. On large-scale benchmarks, FAT improves AUC by up to +0.51% over state-of-the-art methods. Deployed online, it delivers +2.33% CTR and +0.66% RPM. Our work establishes that effective scaling in recommendation arises not from size, but from structured expressivity-architectural coherence with data semantics.

large language model, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2511.12081

Country: North America > United States (0.16)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.88)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.68)

Add feedback

Supplementary Materials for NAR-Former V2: Rethinking Transformer for Universal Neural Network Representation Learning

Neural Information Processing SystemsOct-9-2025, 06:56:16 GMT

Right: Normalized attention scores processed by two different normalization methods. Table 1: Performance of searched architectures using different NAS algorithms in DARTS [ 7 ] space on CIFAR-10 [ 5 ]. The inference latency was measured on a machine with GeForce RTX 3090 GPU. The batch size was set to 1. Encode(ms) Infer(ms) Total(ms) NAR-Former 2.4784 17.4864 19.9648 NAR-Former V2 2.3722 5.2276 7.5998 may be somewhat different. Due to the softmax, Eq. ( 5) focuses almost all attention on the current The Eq. ( 2) restricts attention to connected nodes by introducing the adjacency matrix.

artificial intelligence, machine learning, prediction, (12 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

Add feedback

Rethinking Transformer for Long Contextual Histopathology Whole Slide Image Analysis

Neural Information Processing SystemsMay-27-2025, 14:01:00 GMT

Histopathology Whole Slide Image (WSI) analysis serves as the gold standard for clinical cancer diagnosis in the daily routines of doctors. To develop computer-aided diagnosis model for histopathology WSIs, previous methods typically employ Multi-Instance Learning to enable slide-level prediction given only slide-level labels.Among these models, vanilla attention mechanisms without pairwise interactions have traditionally been employed but are unable to model contextual information. More recently, self-attention models have been utilized to address this issue. To alleviate the computational complexity of long sequences in large WSIs, methods like HIPT use region-slicing, and TransMIL employs Nystr\"{o}mformer as an approximation of full self-attention. Both approaches suffer from suboptimal performance due to the loss of key information.

artificial intelligence, machine learning, transformer, (8 more...)

Neural Information Processing Systems

Industry: Health & Medicine (1.00)

Technology:

Information Technology > Sensing and Signal Processing > Image Processing (0.40)
Information Technology > Artificial Intelligence > Machine Learning (0.40)

Add feedback

NAR-Former V2: Rethinking Transformer for Universal Neural Network Representation Learning

Neural Information Processing SystemsJan-19-2025, 21:36:23 GMT

As more deep learning models are being applied in real-world applications, there is a growing need for modeling and learning the representations of neural networks themselves. An effective representation can be used to predict target attributes of networks without the need for actual training and deployment procedures, facilitating efficient network design and deployment. Recently, inspired by the success of Transformer, some Transformer-based representation learning frameworks have been proposed and achieved promising performance in handling cell-structured models. However, graph neural network (GNN) based approaches still dominate the field of learning representation for the entire network. In this paper, we revisit the Transformer and compare it with GNN to analyze their different architectural characteristics. We then propose a modified Transformer-based universal neural network representation learning model NAR-Former V2.

representation, transformer, universal neural network representation learning, (5 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

You Only Look at One Sequence: Rethinking Transformer in Vision through Object Detection

Neural Information Processing SystemsJan-19-2025, 09:31:43 GMT

Can Transformer perform 2\mathrm{D} object- and region-level recognition from a pure sequence-to-sequence perspective with minimal knowledge about the 2\mathrm{D} spatial structure? To answer this question, we present You Only Look at One Sequence (YOLOS), a series of object detection models based on the vanilla Vision Transformer with the fewest possible modifications, region priors, as well as inductive biases of the target task. We find that YOLOS pre-trained on the mid-sized ImageNet- 1k dataset only can already achieve quite competitive performance on the challenging COCO object detection benchmark, e.g., YOLOS-Base directly adopted from BERT-Base architecture can obtain 42.0 box AP on COCO val. We also discuss the impacts as well as limitations of current pre-train schemes and model scaling strategies for Transformer in vision through YOLOS. Code and pre-trained models are available at https://github.com/hustvl/YOLOS.

only look, rethinking transformer, transformer, (5 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Vision (1.00)

Add feedback

CascadeXML: Rethinking Transformers for End-to-end Multi-resolution Training in Extreme Multi-label Classification

Neural Information Processing SystemsOct-9-2024, 16:30:08 GMT

Extreme Multi-label Text Classification (XMC) involves learning a classifier that can assign an input with a subset of most relevant labels from millions of label choices. Recent approaches, such as XR-Transformer and LightXML, leverage a transformer instance to achieve state-of-the-art performance. However, in this process, these approaches need to make various trade-offs between performance and computational requirements. A major shortcoming, as compared to the Bi-LSTM based AttentionXML, is that they fail to keep separate feature representations for each resolution in a label tree. We thus propose CascadeXML, an end-to-end multi-resolution learning pipeline, which can harness the multi-layered architecture of a transformer model for attending to different label resolutions with separate feature representations.

cascadexml, end-to-end multi-resolution training, extreme multi-label classification, (2 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.90)

Add feedback

Filters

Collaborating Authors

rethinking transformer

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

Supplementary Materials for NAR-Former V2: Rethinking Transformer for Universal Neural Network Representation Learning

NAR-Former V2: Rethinking Transformer for Universal Neural Network Representation Learning

You Only Look at One Sequence: Rethinking Transformer in Vision through Object Detection

CascadeXML: Rethinking Transformers for End-to-end Multi-resolution Training in Extreme Multi-label Classification

From Scaling to Structured Expressivity: Rethinking Transformers for CTR Prediction

Supplementary Materials for NAR-Former V2: Rethinking Transformer for Universal Neural Network Representation Learning

Rethinking Transformer for Long Contextual Histopathology Whole Slide Image Analysis

NAR-Former V2: Rethinking Transformer for Universal Neural Network Representation Learning

You Only Look at One Sequence: Rethinking Transformer in Vision through Object Detection

CascadeXML: Rethinking Transformers for End-to-end Multi-resolution Training in Extreme Multi-label Classification