AITopics | wta-cr

Collaborating Authors

wta-cr

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

By replacing the linear operation with our approximated one in transformers, we can achieve up to 2.7 peak memory reduction with almost no accuracy drop and enables up to6.4 larger batch size.

machine learning, natural language, wta-cr, (18 more...)

Neural Information Processing Systems

Country:

North America > United States > Texas > Brazos County > College Station (0.04)
North America > United States > Georgia > Chatham County > Savannah (0.04)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

Add feedback

Winner-Take-All Column Row Sampling for Memory Efficient Adaptation of Language Model

Liu, Zirui, Wang, Guanchu, Zhong, Shaochen, Xu, Zhaozhuo, Zha, Daochen, Tang, Ruixiang, Jiang, Zhimeng, Zhou, Kaixiong, Chaudhary, Vipin, Xu, Shuai, Hu, Xia

arXiv.org Artificial IntelligenceDec-9-2023

With the rapid growth in model size, fine-tuning the large pre-trained language model has become increasingly difficult due to its extensive memory usage. Previous works usually focus on reducing the number of trainable parameters in the network. While the model parameters do contribute to memory usage, the primary memory bottleneck during training arises from storing feature maps, also known as activations, as they are crucial for gradient calculation. Notably, neural networks are usually trained using stochastic gradient descent. We argue that in stochastic optimization, models can handle noisy gradients as long as the gradient estimator is unbiased with reasonable variance. Following this motivation, we propose a new family of unbiased estimators called WTA-CRS, for matrix production with reduced variance, which only requires storing the sub-sampled activations for calculating the gradient. Our work provides both theoretical and experimental evidence that, in the context of tuning transformers, our proposed estimators exhibit lower variance compared to existing ones. By replacing the linear operation with our approximated one in transformers, we can achieve up to 2.7$\times$ peak memory reduction with almost no accuracy drop and enables up to $6.4\times$ larger batch size. Under the same hardware, WTA-CRS enables better down-streaming task performance by applying larger models and/or faster training speed with larger batch sizes.

equation, memory usage, wta-cr, (13 more...)

arXiv.org Artificial Intelligence

2305.15265

Country: