language modeling
- Europe > Russia > Central Federal District > Moscow Oblast > Moscow (0.04)
- North America > United States > Wisconsin > Milwaukee County > Milwaukee (0.04)
- North America > United States > New York (0.04)
- (4 more...)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
- (2 more...)
- South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
- North America > United States > California > Santa Barbara County > Santa Barbara (0.04)
- Asia > Middle East > Jordan (0.04)
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- North America > United States > California > Los Angeles County > Los Angeles (0.14)
- North America > United States > Louisiana > Orleans Parish > New Orleans (0.04)
- (5 more...)
- Leisure & Entertainment (0.68)
- Media (0.46)
- South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- Europe > Poland > Lower Silesia Province > Wroclaw (0.04)
- Information Technology > Artificial Intelligence > Natural Language (1.00)
- Information Technology > Artificial Intelligence > Speech (0.69)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.47)
Symbolic Discovery of Optimization Algorithms Xiangning Chen 1 2 Chen Liang 1 Da Huang 1 Esteban Real
It is more memory-efficient than Adam as it only keeps track of the momentum. Different from adaptive optimizers, its update has the same magnitude for each parameter calculated through the sign operation. We compare Lion with widely used optimizers, such as Adam and Adafactor, for training a variety of models on different tasks. On image classification, Lion boosts the accuracy of ViT by up to 2% on ImageNet and saves up to 5x the pre-training compute on JFT.
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- North America > United States > Washington > King County > Seattle (0.04)
- North America > United States > Louisiana > Orleans Parish > New Orleans (0.04)
- (7 more...)
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- Asia > South Korea > Seoul > Seoul (0.04)
- Asia > Middle East > UAE > Abu Dhabi Emirate > Abu Dhabi (0.04)
- (2 more...)
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- North America > United States > Louisiana > Orleans Parish > New Orleans (0.04)
- North America > Canada > British Columbia > Vancouver (0.04)
- (11 more...)
Decoupled Context Processing for Context Augmented Language Modeling Zonglin Li
Language models can be augmented with a context retriever to incorporate knowledge from large external databases. By leveraging retrieved context, the neural network does not have to memorize the massive amount of world knowledge within its internal parameters, leading to better parameter efficiency, interpretability and mod-ularity.
- North America > United States > California > Santa Clara County > Palo Alto (0.04)
- Europe > Netherlands > North Holland > Amsterdam (0.04)
- North America > United States > Texas > Travis County > Austin (0.04)
- Research Report > New Finding (0.68)
- Research Report > Experimental Study (0.46)
- Information Technology (1.00)
- Health & Medicine > Therapeutic Area > Neurology (1.00)
- Health & Medicine > Health Care Technology (1.00)
- Health & Medicine > Diagnostic Medicine > Imaging (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
- Information Technology > Artificial Intelligence > Cognitive Science (1.00)
- Information Technology > Data Science (0.93)
- Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.68)
How does Architecture Influence the Base Capabilities of Pre-trained Language Models? A Case Study Based on FFN-Wider and MoE Transformers
Pre-trained language models have been proven to possess strong base capabilities, which not only excel in in-distribution language modeling but also show powerful abilities in out-of-distribution language modeling, transfer learning and few-shot learning. Unlike existing work focusing on the influence of scale on base capabilities, our work examines the influence of architecture on those. Specifically, our concern is: How does architecture influence the base capabilities of pre-trained language models? In this work, we attempt to explain and reverse the decline in base capabilities caused by the architecture of FFN-Wider Transformers, seeking to provide some insights. Through analysis, we found the contribution ratio of Multi-Head Attention (a combination function) to pre-trained language modeling is a key factor affecting base capabilities.