Goto

Collaborating Authors

 llm



Google DeepMind wants to know if chatbots are just virtue signaling

MIT Technology Review

Google DeepMind is calling for the moral behavior of large language models--such as what they do when called on to act as companions, therapists, medical advisors, and so on--to be scrutinized with the same kind of rigor as their ability to code or do math . As LLMs improve, people are asking them to play more and more sensitive roles in their lives. Agents are starting to take actions on people's behalf. LLMs may be able to influence human decision-making . And yet nobody knows how trustworthy this technology really is at such tasks. With coding and math, you have clear-cut, correct answers that you can check, William Isaac, a research scientist at Google DeepMind, told me when I met him and Julia Haas, a fellow research scientist at the firm, for an exclusive preview of their work, which is published in today. That's not the case for moral questions, which typically have a range of acceptable answers: "Morality is an important capability but hard to evaluate," says Isaac. "In the moral domain, there's no right and wrong," adds Haas.




1 Details about the observation formats Figure 1: Example of the observation of WebShop The observation of WebShop is simplified based on the text_rich

Neural Information Processing Systems

The observation of WikiHow is represented in exactly the same way with Zhang et al. [2023]. Table 1: Patterns of WebShop pages Pattern Description search The page to search for an item itemlisting The page listing the search results item The information page of a specific item others The item description page, item feature page, and review pageThe similarity lookup table is defined in Table 2. 1 Table 2: Lookup table of the page similarity of WebShop search itemlisting item others search 1 0 0 0 itemlisting 0 1 0 0 item 0 0 1 0.3 others 0 0 0.3 1 2.2 Lookup table of the instruction similarity function of WikiHow Table 3. Table 3: Patterns of WikiHow instructions Pattern Name Pattern Template search Search an article to learn . . . Owing to the limit of budgets, a subset of only 20 tasks is sampled from the full test set. The visualization is available in Figure 2. It can be seen that the performance of R However, there seems to be a saturation for the performance, which may be attributed to the limited number of the active exemplars and training tasks. The saturation of the average reward comes later than that of the success rate. Double Q-Learning [van Hasselt, 2010] is usually leveraged to ameliorate over-estimation for lookup-based Q-Learning.