Efficient Exploration for LLMs
Dwaracherla, Vikranth, Asghari, Seyed Mohammad, Hao, Botao, Van Roy, Benjamin
–arXiv.org Artificial Intelligence
Large language models demonstrate remarkable capabilities after learning from enormous volumes of text data (Anil et al., 2023; Hoffmann et al., 2022; OpenAI, 2023). Yet, reinforcement learning from human feedback (RLHF) greatly improves their behavior even after only tens of thousands of interactions (Bai et al., 2022; Glaese et al., 2022; Ouyang et al., 2022; Stiennon et al., 2020). The uptake of chatbots affords opportunities to gather increasing volumes of human feedback, with each engagement eliciting expressions of satisfaction or preference (OpenAI, 2022). It is natural to wonder what new capabilities may emerge with this growing source of data. Superhuman ingenuity remains an alluring possibility. With increasing volumes, more can be inferred from human feedback.
arXiv.org Artificial Intelligence
Feb-1-2024