These embodied agents are typically trainedtabula rasain isolated worlds with limited complexity and diversity. Although highly performant, theyare specialist models that do not generalize beyond a narrowsetoftasks.
Most popular benchmarks for comparing LLMs rely on alimited set ofprompt templates, which may not fully capture the LLMs' abilities and can affect the reproducibility ofresults onleaderboards. Manyrecent worksempirically verify prompt sensitivity and advocate for changes in LLM evaluation.
Machine learned models are increasingly entering wider ranges ofdomains inour lives, driving a constantly increasing number of important systems. Large scale systems can be trained in highly parallel and distributed training environments, with a large amount of randomness in training the models.
Problem (1) has been comprehensively investigated in the literature [Duchi et al., 2011, Kingma and Ba, 2015, Loshchilov and Hutter, 2017], and it is well-known that the classical stochastic gradient descent (SGD) achieves a convergence rate of