opponent player
SPACE: Noise Contrastive Estimation Stabilizes Self-Play Fine-Tuning for Large Language Models
Wang, Yibo, Chen, Qing-Guo, Xu, Zhao, Luo, Weihua, Zhang, Kaifu, Zhang, Lijun
Self-play fine-tuning has demonstrated promising abilities in adapting large language models (LLMs) to downstream tasks with limited real-world data. The basic principle is to iteratively refine the model with real samples and synthetic ones generated from itself. However, the existing methods primarily focus on the relative gaps between the rewards for two types of data, neglecting their absolute values. Through theoretical analysis, we identify that the gap-based methods suffer from unstable evolution, due to the potentially degenerated objectives. To address this limitation, we introduce a novel self-play fine-tuning method, namely Self-PlAy via Noise Contrastive Estimation (SPACE), which leverages noise contrastive estimation to capture the real-world data distribution. Specifically, SPACE treats synthetic samples as auxiliary components, and discriminates them from the real ones in a binary classification manner. As a result, SPACE independently optimizes the absolute reward values for each type of data, ensuring a consistently meaningful objective and thereby avoiding the instability issue. Theoretically, we show that the optimal solution of the objective in SPACE aligns with the underlying distribution of real-world data, and SPACE guarantees a provably stable convergence to the optimal distribution. Empirically, we show that SPACE significantly improves the performance of LLMs over various tasks, and outperforms supervised fine-tuning that employs much more real-world samples. Compared to gap-based self-play fine-tuning methods, SPACE exhibits remarkable superiority and stable evolution.
Self-Play Fine-Tuning of Diffusion Models for Text-to-Image Generation
Yuan, Huizhuo, Chen, Zixiang, Ji, Kaixuan, Gu, Quanquan
Fine-tuning Diffusion Models remains an underexplored frontier in generative artificial intelligence (GenAI), especially when compared with the remarkable progress made in fine-tuning Large Language Models (LLMs). While cutting-edge diffusion models such as Stable Diffusion (SD) and SDXL rely on supervised fine-tuning, their performance inevitably plateaus after seeing a certain volume of data. Recently, reinforcement learning (RL) has been employed to fine-tune diffusion models with human preference data, but it requires at least two images ("winner" and "loser" images) for each text prompt. In this paper, we introduce an innovative technique called self-play fine-tuning for diffusion models (SPIN-Diffusion), where the diffusion model engages in competition with its earlier versions, facilitating an iterative self-improvement process. Our approach offers an alternative to conventional supervised fine-tuning and RL strategies, significantly improving both model performance and alignment. Our experiments on the Pick-a-Pic dataset reveal that SPIN-Diffusion outperforms the existing supervised fine-tuning method in aspects of human preference alignment and visual appeal right from its first iteration. By the second iteration, it exceeds the performance of RLHF-based methods across all metrics, achieving these results with less data.
Denoising Opponents Position in Partial Observation Environment
Sayareh, Aref, Sardari, Aria, Khoddami, Vahid, Zare, Nader, da Fonseca, Vinicius Prado, Soares, Amilcar
The RoboCup competitions hold various leagues, and the Soccer Simulation 2D League is a major among them. Soccer Simulation 2D (SS2D) match involves two teams, including 11 players and a coach for each team, competing against each other. The players can only communicate with the Soccer Simulation Server during the game. Several code bases are released publicly to simplify team development. So researchers can easily focus on decision-making and implementing machine learning methods. SS2D actions and behaviors are only partially accurate due to different challenges, such as noise and partial observation. Therefore, one strategy is to implement alternative denoising methods to tackle observation inaccuracy. Our idea is to predict opponent positions while they have yet to be seen in a finite number of cycles using machine learning methods to make more accurate actions such as pass. We will explain our position prediction idea powered by Long Short-Term Memory models (LSTM) and Deep Neural Networks (DNN). The results show that the LSTM and DNN predict the opponents' position more accurately than the standard algorithm, such as the last-seen method.
Game Theory Meets AI and NLP
Before going further, you'll need to understand the concept of game theory. Game theory is basically a branch of applied mathematics. In-game theories (How Game Theory Strategy Improves Decision Making), there are different available tools with the help of which different situations are analyzed. There are parties in-game theories mostly referred to as players and the decision they have taken are interdependent. This is a kind of playing chess in which the turn of one player is associated with the future strategy of the opponent player.
Cyrus 2D Simulation Team Description Paper 2016
Zare, Nader, Keshavarzi, Ashkan, Beheshtian, Seyed Ehsan, Mowla, Hadi, Akbarpour, Aryan, Jafari, Hossein, Baraghi, Keyvan Arab, Zarifi, Mohammad Amin, Javidan, Reza
This description includes some explanation about algorithms and also algorithms that are being implemented by Cyrus team members. The objectives of this description are to express a brief explanation about shoot, block, mark and defensive decision will be given. It also explained about the parts that has been implemented. The base code that Cyrus used is agent 3.11.