Sukthankar, Gita
SC-Phi2: A Fine-tuned Small Language Model for StarCraft II Macromanagement Tasks
Khan, Muhammad Junaid, Sukthankar, Gita
This paper introduces SC-Phi2, a fine-tuned StarCraft II small language model for macromanagement tasks. Small language models, like Phi2, Gemma, and DistilBERT, are streamlined versions of large language models (LLMs) with fewer parameters that require less power and memory to run. To teach Microsoft's Phi2 model about StarCraft, we create a new SC2 text dataset with information about StarCraft races, roles, and actions and use it to fine-tune Phi-2 with self-supervised learning. We pair this language model with a Vision Transformer (ViT) from the pre-trained BLIP-2 (Bootstrapping Language Image Pre-training) model, fine-tuning it on the MSC replay dataset. This enables us to construct dynamic prompts that include visual game state information. Unlike the large models used in StarCraft LLMs such as GPT-3.5, Phi2 is trained primarily on textbook data and contains little inherent knowledge of StarCraft II beyond what is provided by our training process. By using LoRA (Low-rank Adaptation) and quantization, our model can be trained on a single GPU. We demonstrate that our model performs well at micromanagement tasks such as build order and global state prediction with a small number of parameters.
Visual Episodic Memory-based Exploration
Vice, Jack, Ruiz-Sanchez, Natalie, Douglas, Pamela K., Sukthankar, Gita
In humans, intrinsic motivation is an important mechanism for open-ended cognitive development; in robots, it has been shown to be valuable for exploration. An important aspect of human cognitive development is $\textit{episodic memory}$ which enables both the recollection of events from the past and the projection of subjective future. This paper explores the use of visual episodic memory as a source of intrinsic motivation for robotic exploration problems. Using a convolutional recurrent neural network autoencoder, the agent learns an efficient representation for spatiotemporal features such that accurate sequence prediction can only happen once spatiotemporal features have been learned. Structural similarity between ground truth and autoencoder generated images is used as an intrinsic motivation signal to guide exploration. Our proposed episodic memory model also implicitly accounts for the agent's actions, motivating the robot to seek new interactive experiences rather than just areas that are visually dissimilar. When guiding robotic exploration, our proposed method outperforms the Curiosity-driven Variational Autoencoder (CVAE) at finding dynamic anomalies.
Smart Sampling: Self-Attention and Bootstrapping for Improved Ensembled Q-Learning
Khan, Muhammad Junaid, Ahmed, Syed Hammad, Sukthankar, Gita
We present a novel method aimed at enhancing the sample efficiency of ensemble Q learning. Our proposed approach integrates multi-head self-attention into the ensembled Q networks while bootstrapping the state-action pairs ingested by the ensemble. This not only results in performance improvements over the original REDQ (Chen et al. 2021) and its variant DroQ (Hi-raoka et al. 2022), thereby enhancing Q predictions, but also effectively reduces both the average normalized bias and standard deviation of normalized bias within Q-function ensembles. Importantly, our method also performs well even in scenarios with a low update-to-data (UTD) ratio. Notably, the implementation of our proposed method is straightforward, requiring minimal modifications to the base model.
The Potential of Vision-Language Models for Content Moderation of Children's Videos
Ahmed, Syed Hammad, Hu, Shengnan, Sukthankar, Gita
Natural language supervision has been shown to be effective for zero-shot learning in many computer vision tasks, such as object detection and activity recognition. However, generating informative prompts can be challenging for more subtle tasks, such as video content moderation. This can be difficult, as there are many reasons why a video might be inappropriate, beyond violence and obscenity. For example, scammers may attempt to create junk content that is similar to popular educational videos but with no meaningful information. This paper evaluates the performance of several CLIP variations for content moderation of children's cartoons in both the supervised and zero-shot setting. We show that our proposed model (Vanilla CLIP with Projection Layer) outperforms previous work conducted on the Malicious or Benign (MOB) benchmark for video content moderation. This paper presents an in depth analysis of how context-specific language prompts affect content moderation performance. Our results indicate that it is important to include more context in content moderation prompts, particularly for cartoon videos as they are not well represented in the CLIP training data.
Improving the Generalizability of Collaborative Dialogue Analysis with Multi-Feature Embeddings
Enayet, Ayesha, Sukthankar, Gita
Conflict prediction in communication is integral to the design of virtual agents that support successful teamwork by providing timely assistance. The aim of our research is to analyze discourse to predict collaboration success. Unfortunately, resource scarcity is a problem that teamwork researchers commonly face since it is hard to gather a large number of training examples. To alleviate this problem, this paper introduces a multi-feature embedding (MFeEmb) that improves the generalizability of conflict prediction models trained on dialogue sequences. MFeEmb leverages textual, structural, and semantic information from the dialogues by incorporating lexical, dialogue acts, and sentiment features. The use of dialogue acts and sentiment features reduces performance loss from natural distribution shifts caused mainly by changes in vocabulary. This paper demonstrates the performance of MFeEmb on domain adaptation problems in which the model is trained on discourse from one task domain and applied to predict team performance in a different domain. The generalizability of MFeEmb is quantified using the similarity measure proposed by Bontonou et al. (2021). Our results show that MFeEmb serves as an excellent domain-agnostic representation for meta-pretraining a few-shot model on collaborative multiparty dialogues.
Leveraging the Variance of Return Sequences for Exploration Policy
Xi, Zerong, Sukthankar, Gita
This paper introduces a method for constructing an upper bound for exploration policy using either the weighted variance of return sequences or the weighted temporal difference (TD) error. We demonstrate that the variance of the return sequence for a specific state-action pair is an important information source that can be leveraged to guide exploration in reinforcement learning. The intuition is that fluctuation in the return sequence indicates greater uncertainty in the near future returns. This divergence occurs because of the cyclic nature of value-based reinforcement learning; the evolving value function begets policy improvements which in turn modify the value function. Although both variance and TD errors capture different aspects of this uncertainty, our analysis shows that both can be valuable to guide exploration. We propose a two-stream network architecture to estimate weighted variance/TD errors within DQN agents for our exploration method and show that it outperforms the baseline on a wide range of Atari games.
Reports on the 2016 IJCAI Workshop Series
Srivastava, Biplav (AAAI) | Sukthankar, Gita (University of Central Florida)
Embedding making, political analysis, and intelligence analysis; morality when handling preferences and dealing models of biomedical argumentation in research journals with the potential and risks of big data were identified and popular media; annotation of rhetorical figures; as challenging endeavors for the future.
Architectures for Activity Recognition and Context-Aware Computing
Geib, Christopher (Drexel University) | Agrawal, Vikas (Infosys Limited) | Sukthankar, Gita (University of Central Florida) | Shastri, Lokendra (Infosys Limited) | Bui, Hung (Nuance Communications)
The last 10 years have seen the development of novel architectures and technologies for domainfocused, task-specific systems that know many things, such as who (identities, profile, history) they are with (social context) and in what role (responsibility, security, privacy); when and where (event, time, place); why (goals, shared or personal); how are they doing it (methods, applications); and using what resources (device, services, access, and ownership). Smart spaces and devices will increasingly use such contextual knowledge to help users move seamlessly between devices and applications, without having to explicitly carry, transfer, and exchange activity context. Such systems will qualitatively shift our lives both at work and play and significantly change our interactions both with our physical and virtual worlds. This dream of seamlessly interacting with our virtual environment has a long history as can be seen in Apple Inc.'s Knowledge Navigator 1987 concept video. However, the combination of dramatic progress in low-power mobile computing devices and sensors, with advances in artificial intelligence and human-computer interaction (HCI) in the last decade, have provided the kind of platforms and algorithms that are enabling context-aware virtual personal assistants that plan activities and recognize intent. This has lead to an increase in work designed to bring these ideas into real world application and address the final technical hurdles that will make such systems a reality.
The Ninth Annual AAAI Conference on Artificial Intelligence and Interactive Digital Entertainment (AIIDE): A Report
Sukthankar, Gita (University of Central Florida) | Horswill, Ian (Northwestern University)
The Ninth Annual AAAI Conference on Artificial Intelligence and Interactive Digital Entertainment (AIIDE) was held October 14–18, 2013, at Northeastern University in Boston, Massachusetts. The mission of the AIIDE conference is to provide a forum for researchers and game developers to discuss ways that AI can enhance games and other forms of interactive entertainment. In addition to presentations on adapting standard AI techniques such as search, planning and machine learning for use within games, key topic areas include creating realistic autonomous characters, interactive narrative, procedural content generation, and integrating AI into game design and production tools.
The Ninth Annual AAAI Conference on Artificial Intelligence and Interactive Digital Entertainment (AIIDE): A Report
Sukthankar, Gita (University of Central Florida) | Horswill, Ian (Northwestern University)
The Ninth Annual AAAI Conference on Artificial Intelligence and Interactive Digital Entertainment (AIIDE) was held October 14–18, 2013, at Northeastern University in Boston, Massachusetts. The mission of the AIIDE conference is to provide a forum for researchers and game developers to discuss ways that AI can enhance games and other forms of interactive entertainment. In addition to presentations on adapting standard AI techniques such as search, planning and machine learning for use within games, key topic areas include creating realistic autonomous characters, interactive narrative, procedural content generation, and integrating AI into game design and production tools.