contribution ratio
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- Europe > Belgium > Brussels-Capital Region > Brussels (0.04)
- Asia > China > Hong Kong (0.04)
- (11 more...)
- Research Report > Experimental Study (0.93)
- Research Report > New Finding (0.93)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.94)
- Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.71)
How does Architecture Influence the Base Capabilities of Pre-trained Language Models? A Case Study Based on FFN-Wider and MoE Transformers
Pre-trained language models have been proven to possess strong base capabilities, which not only excel in in-distribution language modeling but also show powerful abilities in out-of-distribution language modeling, transfer learning and few-shot learning. Unlike existing work focusing on the influence of scale on base capabilities, our work examines the influence of architecture on those. Specifically, our concern is: How does architecture influence the base capabilities of pre-trained language models? In this work, we attempt to explain and reverse the decline in base capabilities caused by the architecture of FFN-Wider Transformers, seeking to provide some insights. Through analysis, we found the contribution ratio of Multi-Head Attention (a combination function) to pre-trained language modeling is a key factor affecting base capabilities.
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- Europe > Belgium > Brussels-Capital Region > Brussels (0.04)
- Asia > China > Hong Kong (0.04)
- (11 more...)
- Research Report > Experimental Study (0.93)
- Research Report > New Finding (0.93)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.94)
- Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.71)
How does Architecture Influence the Base Capabilities of Pre-trained Language Models? A Case Study Based on FFN-Wider and MoE Transformers
Pre-trained language models have been proven to possess strong base capabilities, which not only excel in in-distribution language modeling but also show powerful abilities in out-of-distribution language modeling, transfer learning and few-shot learning. Unlike existing work focusing on the influence of scale on base capabilities, our work examines the influence of architecture on those. Specifically, our concern is: How does architecture influence the base capabilities of pre-trained language models? In this work, we attempt to explain and reverse the decline in base capabilities caused by the architecture of FFN-Wider Transformers, seeking to provide some insights. Through analysis, we found the contribution ratio of Multi-Head Attention (a combination function) to pre-trained language modeling is a key factor affecting base capabilities.
How does Architecture Influence the Base Capabilities of Pre-trained Language Models? A Case Study Based on FFN-Wider Transformer Models
Lu, Xin, Zhao, Yanyan, Qin, Bing
Pre-trained language models have been proven to possess strong base capabilities, which not only excel in in-distribution language modeling but also show powerful abilities in out-of-distribution language modeling, transfer learning and few-shot learning. Unlike existing work focusing on the influence of scale on base capabilities, our work examines the influence of architecture on those. Specifically, our concern is: How does architecture influence the base capabilities of pre-trained language models? In this work, we attempt to explain and reverse the decline in base capabilities caused by the architecture of FFN-Wider Transformers, seeking to provide some insights. Through analysis, we found the contribution ratio of Multi-Head Attention (a combination function) to pre-trained language modeling is a key factor affecting base capabilities. FFN-Wider Transformers reduce the contribution ratio of this combination function, leading to a decline in base capabilities. We confirmed this by experiments and proposed Combination Enhancement Architecture (CEA) to address the decline in base capabilities of such models. Significantly, we extended our explanation and CEA to Mixture of Experts (MoE) architecture Transformers, which also alleviated their decline in base capabilities to some extent, proving our work can offer useful guidance for architecture analysis, architecture improvement and architecture design.
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- Europe > Belgium > Brussels-Capital Region > Brussels (0.04)
- North America > United States > California > San Diego County > San Diego (0.04)
- (7 more...)
Dataset Structural Index: Leveraging a machine's perspective towards visual data
But when it came to visual datasets, the field immediately stepped towards the algorithmic side. One of the fundamental reasons was the amount of information needed to translate from an image. But with the introduction of convolutional networks and transfer learning [1], [2], [3], it is possible to convert an image or a visual object into feature vectors without losing too much information about the entity under concern. It defined a way to use feature maps to compare and distinguish one visual object from another [4]. There has been a lot of work in using these feature vector conversions in systems like content-based image retrievals [5], using feature vectors as representations of different scenarios [6], [7]. It is critical to understand that there is a difference between the way a machine looks at the data and the way we do. There are scenarios in which the interpretation through features is a little different from the interpretation of humans. DSI is there to bridge the gap and understand the machine's perspective before molding it to shape better architectures, in turn, better model performances. I think two concepts could be linked together to understand a machine's viewpoint while working with visual
- Europe > United Kingdom > England > Staffordshire (0.04)
- Oceania > New Zealand > South Island > Marlborough District > Blenheim (0.04)
- North America > United States > Virginia (0.04)
- (3 more...)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Supervised Learning > Representation Of Examples (0.75)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.69)
Distance-based classifier by data transformation for high-dimension, strongly spiked eigenvalue models
Aoshima, Makoto, Yata, Kazuyoshi
We consider classifiers for high-dimensional data under the strongly spiked eigenvalue (SSE) model. We first show that high-dimensional data often have the SSE model. We consider a distance-based classifier using eigenstructures for the SSE model. We apply the noise reduction methodology to estimation of the eigenvalues and eigenvectors in the SSE model. We create a new distance-based classifier by transforming data from the SSE model to the non-SSE model. We give simulation studies and discuss the performance of the new classifier. Finally, we demonstrate the new classifier by using microarray data sets.