Harrison, Brent
The Goofus & Gallant Story Corpus for Practical Value Alignment
Nahian, Md Sultan Al, Tasrin, Tasmia, Frazier, Spencer, Riedl, Mark, Harrison, Brent
Values or principles are key elements of human society that influence people to behave and function according to an accepted standard set of social rules to maintain social order. As AI systems are becoming ubiquitous in human society, it is a major concern that they could violate these norms or values and potentially cause harm. Thus, to prevent intentional or unintentional harm, AI systems are expected to take actions that align with these principles. Training systems to exhibit this type of behavior is difficult and often requires a specialized dataset. This work presents a multi-modal dataset illustrating normative and non-normative behavior in real-life situations described through natural language and artistic images. This training set contains curated sets of images that are designed to teach young children about social principles. We argue that this is an ideal dataset to use for training socially normative agents given this fact.
Guiding Reinforcement Learning Using Uncertainty-Aware Large Language Models
Shoaeinaeini, Maryam, Harrison, Brent
Human guidance in reinforcement learning (RL) is often impractical for large-scale applications due to high costs and time constraints. Large Language Models (LLMs) offer a promising alternative to mitigate RL sample inefficiency and potentially replace human trainers. However, applying LLMs as RL trainers is challenging due to their overconfidence and less reliable solutions in sequential tasks. We address this limitation by introducing a calibrated guidance system that uses Monte Carlo Dropout to enhance LLM advice reliability by assessing prediction variances from multiple forward passes. Additionally, we develop a novel RL policy shaping method based on dynamic model average entropy to adjust the LLM's influence on RL policies according to guidance uncertainty. This approach ensures robust RL training by relying on reliable LLM guidance. To validate our contributions, we conduct extensive experiments in a Minigrid environment with three goals in varying environment sizes. The results showcase superior model performance compared to uncalibrated LLMs, unguided RL, and calibrated LLMs with different shaping policies. Moreover, we analyze various uncertainty estimation methods, demonstrating the effectiveness of average entropy in reflecting higher uncertainty in incorrect guidance. These findings highlight the persistent overconfidence in fine-tuned LLMs and underscore the importance of effective calibration in sequential decision-making problems.
Controllable Neural Story Plot Generation via Reward Shaping
Tambwekar, Pradyumna, Dhuliawala, Murtaza, Martin, Lara J., Mehta, Animesh, Harrison, Brent, Riedl, Mark O.
By themselves, large neural language models have Language-modeling-based approaches to story been shown to work well with a variety of short-term tasks, plot generation attempt to construct a plot by sampling such as understanding short children's stories [Radford et al., from a language model (LM) to predict the 2019]. However, while recurrent neural networks (RNNs) using next character, word, or sentence to add to the story. LSTM or GRU cells can theoretically maintain long-term LM techniques lack the ability to receive guidance context in their hidden layers, in practice RNNs only use a from the user to achieve a specific goal, resulting in relatively small part of the history of tokens [Khandelwal et stories that don't have a clear sense of progression al., 2018]. Consequently, stories or plots generated by RNNs and lack coherence. We present a reward-shaping tend to lose coherence as the generation continues.
Machine Learning Approaches for Principle Prediction in Naturally Occurring Stories
Nahian, Md Sultan Al, Frazier, Spencer, Harrison, Brent, Riedl, Mark
Value alignment is the task of creating autonomous systems whose values align with those of humans. Past work has shown that stories are a potentially rich source of information on human values; however, past work has been limited to considering values in a binary sense. In this work, we explore the use of machine learning models for the task of normative principle prediction on naturally occurring story data. To do this, we extend a dataset that has been previously used to train a binary normative classifier with annotations of moral principles. We then use this dataset to train a variety of machine learning models, evaluate these models and compare their results against humans who were asked to perform the same task. We show that while individual principles can be classified, the ambiguity of what "moral principles" represent, poses a challenge for both human participants and autonomous systems which are faced with the same task.
StyleM: Stylized Metrics for Image Captioning Built with Contrastive N-grams
Li, Chengxi, Harrison, Brent
StyleCIDEr supports scoring the similarity of two compared captions with respect to their styles. We evaluate these two metrics using three stylized captioning methods trained on the PERSONALITY-CAPTIONS and FlickrStyle10K datasets: UPDOWN, MULTI-UPDOWN, and SVinVL. We also perform a human study to explore how well each caption aligns with human judgments in similar situations.
Modelling Cournot Games as Multi-agent Multi-armed Bandits
Taywade, Kshitija, Harrison, Brent, Bagh, Adib
We investigate the use of a multi-agent multi-armed bandit (MA-MAB) setting for modeling repeated Cournot oligopoly games, where the firms acting as agents choose from the set of arms representing production quantity (a discrete value). Agents interact with separate and independent bandit problems. In this formulation, each agent makes sequential choices among arms to maximize its own reward. Agents do not have any information about the environment; they can only see their own rewards after taking an action. However, the market demand is a stationary function of total industry output, and random entry or exit from the market is not allowed. Given these assumptions, we found that an $\epsilon$-greedy approach offers a more viable learning mechanism than other traditional MAB approaches, as it does not require any additional knowledge of the system to operate. We also propose two novel approaches that take advantage of the ordered action space: $\epsilon$-greedy+HL and $\epsilon$-greedy+EL. These new approaches help firms to focus on more profitable actions by eliminating less profitable choices and hence are designed to optimize the exploration. We use computer simulations to study the emergence of various equilibria in the outcomes and do the empirical analysis of joint cumulative regrets.
Explore, Exploit or Listen: Combining Human Feedback and Policy Model to Speed up Deep Reinforcement Learning in 3D Worlds
Lin, Zhiyu, Harrison, Brent, Keech, Aaron, Riedl, Mark O.
We describe a method to use discrete human feedback to enhance the performance of deep learning agents in virtual three-dimensional environments by extending deep-reinforcement learning to model the confidence and consistency of human feedback. This enables deep reinforcement learning algorithms to determine the most appropriate time to listen to the human feedback, exploit the current policy model, or explore the agent's environment. Managing the trade-off between these three strategies allows DRL agents to be robust to inconsistent or intermittent human feedback. Through experimentation using a synthetic oracle, we show that our technique improves the training speed and overall performance of deep reinforcement learning in navigating three-dimensional environments using Minecraft. We further show that our technique is robust to highly innacurate human feedback and can also operate when no human feedback is given.
Training Value-Aligned Reinforcement Learning Agents Using a Normative Prior
Nahian, Md Sultan Al, Frazier, Spencer, Harrison, Brent, Riedl, Mark
As more machine learning agents interact with humans, it is increasingly a prospect that an agent trained to perform a task optimally, using only a measure of task performance as feedback, can violate societal norms for acceptable behavior or cause harm. Value alignment is a property of intelligent agents wherein they solely pursue non-harmful behaviors or human-beneficial goals. We introduce an approach to value-aligned reinforcement learning, in which we train an agent with two reward signals: a standard task performance reward, plus a normative behavior reward. The normative behavior reward is derived from a value-aligned prior model previously shown to classify text as normative or non-normative. We show how variations on a policy shaping technique can balance these two sources of reward and produce policies that are both effective and perceived as being more normative. We test our value-alignment technique on three interactive text-based worlds; each world is designed specifically to challenge agents with a task as well as provide opportunities to deviate from the task to engage in normative and/or altruistic behavior.
Influencing Reinforcement Learning through Natural Language Guidance
Tasrin, Tasmia, Nahian, Md Sultan Al, Perera, Habarakadage, Harrison, Brent
Interactive reinforcement learning agents use human feedback or instruction to help them learn in complex environments. Often, this feedback comes in the form of a discrete signal that is either positive or negative. While informative, this information can be difficult to generalize on its own. In this work, we explore how natural language advice can be used to provide a richer feedback signal to a reinforcement learning agent by extending policy shaping, a well-known Interactive reinforcement learning technique. Usually policy shaping employs a human feedback policy to help an agent to learn more about how to achieve its goal. In our case, we replace this human feedback policy with policy generated based on natural language advice. We aim to inspect if the generated natural language reasoning provides support to a deep reinforcement learning agent to decide its actions successfully in any given environment. So, we design our model with three networks: first one is the experience driven, next is the advice generator and third one is the advice driven. While the experience driven reinforcement learning agent chooses its actions being influenced by the environmental reward, the advice driven neural network with generated feedback by the advice generator for any new state selects its actions to assist the reinforcement learning agent to better policy shaping.
Decentralized Marriage Models
Taywade, Kshitija (University of Kentucky ) | Goldsmith, Judy (University of Kentucky) | Harrison, Brent (University of Kentucky)
Most matching algorithms are centralized in that a single agent determines how other agents are matched together. This is contrary to how humans form matches in the real world. In this work, we propose three decentralized approaches for finding matchings that are inspired by three techniques that humans use to find matches. The first is to have individuals wander a grid environment, interacting and deciding preferences over potential partners. The second uses affiliation networks where agencies recommend potential partners. The third is based on small-world social networks, where we assume that individuals probabilistically introduce their friends to one another. we introduce a heuristic algorithm that can be used in each of these environments. We also explore how this algorithm can scale to a large number of agents.