Reinforcement Learning
Grounded ReinforcementLearning: LearningtoWintheGameunderHumanCommands SupplementaryMaterials
Inthis section, we describe the details ofMiniRTSEnvironment and human dataset. The data do not contain any personally identifiable information or offensivecontent. Figure 1: MiniRTS [2]implements the rockpaper-scissors attack graph, each army type has some units it is effective against and vulnerableto. "swordman","spearman"and"cavalry"allare effectiveagainst"archer" Figure 2: Building units can produce different army units using resources. Resource Units: Resource units are stationary and neutral.
A Learning and Sampling
A.1 Deep generative modelling A complete trajectory is denoted by ζ " t s The log-likelihood function is: Lpθ q " ÿ Applying this simple identiy, we also have: 0 " E On the other hand, it discourages action samples directly sampled from the prior. To ensure the transition model's validity, it needs to be grounded in real-world dynamics when jointly learned with the policy. Otherwise, the agent would be purely hallucinating based on the demonstrations. It would not be a problem if the action space is quantized. Intuitively, action samples at each step are updated with the energy of all subsequent actions and a single-step forward by back-propagation. To train the policy, Eq. (8) can now be rewritten as δ Eq. (5) is an empirical estimate of E We first prove the construction above is valid at optimality.
TheValue-EquivalencePrinciple forModel-Based ReinforcementLearning SupplementaryMaterial
Moreover, we include an additional result which illustrates a situation in which approximate VE models can outperform the MLEmodel. For each (i,j) pair, the above expression is suggestive of a dot-product between twon m vectors: a combination ofai and cj, and a "flattened" version ofB. Define the former combination of vectors asdij = [ai1cj1,ai1cj2,,aincjm]> Rnm 1, and stack them as rows as: D =[d11,d12,,dnm]> Rk` nm.ToflattenB,simplydefineb=[B11,B12,,Bnm]> Finally notice that the construction ofdij can be thought of as vertically stackingn copies ofcj eachscaledbyadifferententryin ai. This means that scaled copies of bothai and cj can be found by selecting specific groups of indices indij. It follows that ifa1,...,an are linearly independent then so ared1j,...,dnj for any j.