TrueSkill Through Time: Revisiting the History of Chess

Neural Information Processing Systems

We extend the Bayesian skill rating system TrueSkill to infer entire time series of skills of players by smoothing through time instead of filtering. The skill of each participating player, say, every year is represented by a latent skill variable which is affected by the relevant game outcomes that year, and coupled with the skill variables of the previous and subsequent year. Inference in the resulting factor graph is carried out by approximate message passing (EP) along the time series of skills. As before the system tracks the uncertainty about player skills, explicitly models draws, can deal with any number of competing entities and can infer individual skills from team results. We extend the system to estimate player-specific draw margins. Basedon these models we present an analysis of the skill curves of important players in the history of chess over the past 150 years. Results include plots of players' lifetime skill development as well as the ability to compare the skills of different players across time. Our results indicate that a) the overall playing strength has increased over the past 150 years, and b) that modelling a player's ability to force a draw provides significantly better predictive power.


ReACTR: Realtime Algorithm Configuration through Tournament Rankings

AAAI Conferences

It is now readily accepted that automated algorithm configuration is a necessity for ensuring optimized performance of solvers on a particular problem domain. Even the best developers who have carefully designed their solver are not always able to manually find the best parameter settings for it. Yet, the opportunity for improving performance has been repeatedly demonstrated by configuration tools like ParamILS, SMAC, and GGA. However, all these techniques currently assume a static environment, where demonstrative instances are procured beforehand, potentially unlimited time is provided to adequately search the parameter space, and the solver would never need to be retrained. This is not always the case in practice. The ReACT system, proposed in 2014, demonstrated that a solver could be configured during runtime as new instances arrive in a steady stream. This paper further develops that approach and shows how a ranking scheme, like TrueSkill, can further improve the configurator's performance, making it able to quickly find good parameterizations without adding any overhead on the time needed to solve any new instance, and then continuously improve as new instances are evaluated. The enhancements to ReACT that we present enable us to even outperform existing static configurators like SMAC in a non-dynamic setting.


Simulating Ability: Representing Skills in Games

arXiv.org Artificial Intelligence

Throughout the history of games, representing the abilities of the various agents acting on behalf of the players has been a central concern. With increasingly sophisticated games emerging, these simulations have become more realistic, but the underlying mechanisms are still, to a large extent, of an ad hoc nature. This paper proposes using a logistic model from psychometrics as a unified mechanism for task resolution in simulation-oriented games.


EPP: interpretable score of model predictive power

arXiv.org Machine Learning

The most important part of model selection and hyperparameter tuning is the evaluation of model performance. The most popular measures, such as AUC, F1, ACC for binary classification, or RMSE, MAD for regression, or cross-entropy for multilabel classification share two common weaknesses. First is, that they are not on an interval scale. It means that the difference in performance for the two models has no direct interpretation. It makes no sense to compare such differences between datasets. Second is, that for k-fold cross-validation, the model performance is in most cases calculated as an average performance from particular folds, which neglects the information how stable is the performance for different folds. In this talk, we introduce a new EPP rating system for predictive models. We also demonstrate numerous advantages for this system, First, differences in EPP scores have probabilistic interpretation. Based on it we can assess the probability that one model will achieve better performance than another. Second, EPP scores can be directly compared between datasets. Third, they can be used for navigated hyperparameter tuning and model selection. Forth, we can create embeddings for datasets based on EPP scores.


Predicting Army Combat Outcomes in StarCraft

AAAI Conferences

Smart decision making at the tactical level is important for Artificial Intelligence (AI) agents to perform well in the domain of real-time strategy (RTS) games.  This paper presents a Bayesian model that can be used to predict the outcomes of isolated battles, as well as predict what units are needed to defeat a given army.  Model parameters are learned from simulated battles, in order to minimize the dependency on player skill.  We apply our model to the game of StarCraft,  with the end-goal of using the predictor as a module for making high-level combat decisions, and show that the model is capable of making accurate predictions.