We Don't Need No Adam, All We Need Is EVE: On The Variance of Dual Learning Rate And Beyond
–arXiv.org Artificial Intelligence
Deep learning has become a pivotal technology across various domains including natural language processing, computer vision, speech recognition, and medical diagnostics [1, 2]. Deep neural networks (DNNs), characterized by multiple hidden layers, have shown unparalleled success in learning complex patterns from large-scale data. However, the training of these models requires the fine-tuning of millions or even billions of parameters, which presents significant optimisation challenges [3-7]. A large body of research has focused on optimisation techniques to enhance the convergence speed, stability, and generalisation capability of deep models. Conventional techniques like Stochastic Gradient Descent (SGD) [8] and its variations including Momentum [9], Adagrad [10], RMSprop [11], and Adam [12] have been widely used.
arXiv.org Artificial Intelligence
Aug-21-2023
- Country:
- North America
- United States > California
- San Diego County > San Diego (0.04)
- Canada > Ontario
- Toronto (0.14)
- United States > California
- Europe
- Russia (0.04)
- France (0.04)
- United Kingdom > England
- Cambridgeshire > Cambridge (0.04)
- Asia
- Russia (0.04)
- Middle East > Lebanon (0.04)
- Africa > Middle East
- Tunisia > Ben Arous Governorate > Ben Arous (0.04)
- North America
- Genre:
- Research Report (0.64)
- Industry:
- Health & Medicine (0.48)
- Technology: