Quantum Computing (QC) is a disruptive technology that may not be too far ahead in the horizon. Small proof-of-concept quantum computers have already been built  and major obstacles to large-scale quantum computing are being heavily researched  . Among its potential uses, QC will allow breaking classical cryptographic codes, simulate large quantum systems and faster search and optimization  . This last use case is of particular interest to Artificial Intelligence (AI) Strategy. In particular, variants of the Grover algorithm can be exploited to gain a quadratic speedup in search problems, and some recent Quantum Machine Learning (QML) developments have led to exponential gains in certain Machine Learning tasks  (though with important caveats which may invalidate their practical use ). These ideas have the potential to exert a transformative effect on research in AI (as noted in , for example). Furthermore the technical aspects of QC, which put some physical limits on the observation of the inner workings of a quantum machine and hinder the verification of quantum computations , may pose an additional challenge for AI Alignment concerns. In this short article we introduce a heuristic model of quantum computing that captures the most relevant characteristics of QC for technical AI Alignment research.
Deep neural networks have been successfully applied in learning the board games Go, chess and shogi without prior knowledge by making use of reinforcement learning. Although starting from zero knowledge has been shown to yield impressive results, it is associated with high computationally costs especially for complex games. With this paper, we present CrazyAra which is a neural network based engine solely trained in supervised manner for the chess variant crazyhouse. Crazyhouse is a game with a higher branching factor than chess and there is only limited data of lower quality available compared to AlphaGo. Therefore, we focus on improving efficiency in multiple aspects while relying on low computational resources. These improvements include modifications in the neural network design and training configuration, the introduction of a data normalization step and a more sample efficient Monte-Carlo tree search which has a lower chance to blunder. After training on 569,537 human games for 1.5 days we achieve a move prediction accuracy of 60.4%. During development, versions of CrazyAra played professional human players. Most notably, CrazyAra achieved a four to one win over 2017 crazyhouse world champion Justin Tan (aka LM Jann Lee) who is more than 400 Elo higher rated compared to the average player in our training set. Furthermore, we test the playing strength of CrazyAra on CPU against all participants of the second Crazyhouse Computer Championships 2017, winning against twelve of the thirteen participants. Finally, for CrazyAraFish we continue training our model on generated engine games. In ten long-time control matches playing Stockfish 10, CrazyAraFish wins three games and draws one out of ten matches.
Consistent and reproducible evaluation of Deep Reinforcement Learning (DRL) is not straightforward. In the Arcade Learning Environment (ALE), small changes in environment parameters such as stochasticity or the maximum allowed play time can lead to very different performance. In this work, we discuss the difficulties of comparing different agents trained on ALE. In order to take a step further towards reproducible and comparable DRL, we introduce SABER, a Standardized A tari BEnchmark for general R einforcement learning algorithms. Our methodology extends previous recommendations and contains a complete set of environment parameters as well as train and test procedures. We then use SABER to evaluate the current state of the art, Rainbow. Furthermore, we introduce a human world records baseline, and argue that previous claims of expert or superhuman performance of DRL might not be accurate. Finally, we propose Rainbow-IQN by extending Rainbow with Implicit Quantile Networks (IQN) leading to new state-of-the-art performance. Source code is available for reproducibility.
Off-policy evaluation (OPE) in reinforcement learning allows one to evaluate novel decision policies without needing to conduct exploration, which is often costly or otherwise infeasible. We consider for the first time the semiparametric efficiency limits of OPE in Markov decision processes (MDPs), where actions, rewards, and states are memoryless. We show existing OPE estimators may fail to be efficient in this setting. We develop a new estimator based on cross-fold estimation of $q$-functions and marginalized density ratios, which we term double reinforcement learning (DRL). We show that DRL is efficient when both components are estimated at fourth-root rates and is also doubly robust when only one component is consistent. We investigate these properties empirically and demonstrate the performance benefits due to harnessing memorylessness efficiently.
In this paper, we introduce the Double Traveling Salesman Problem with Partial Last-In-First-Out Loading Constraints (DTSPPL), a pickup-and-delivery single-vehicle routing problem where all pickup operations must be performed before any delivery one because the pickup and delivery areas are geographically separated. The vehicle collects items in the pickup area and loads them into its container, a horizontal stack. After performing all pickup operations, the vehicle begins delivering the items in the delivery area. Loading and unloading operations must obey a partial Last-In-First-Out (LIFO) policy, i.e., a version of the LIFO policy that may be violated within a given reloading depth. The objective of the DTSPPL is to minimize the total cost, which involves the total distance traveled by the vehicle and the number of reloaded items due to violations of the standard LIFO policy. We formally describe the DTSPPL by means of two Integer Linear Programming (ILP) formulations, and propose a heuristic algorithm based on the Biased Random-Key Genetic Algorithm (BRKGA) to find high-quality solutions. The performance of the proposed solution approaches is assessed over a broad set of instances. Computational results have shown that both ILP formulations were able to solve only the smaller instances, whereas the BRKGA obtained better solutions for almost all instances, requiring shorter computational time.
The article considers a simulation modelling problem related to the chess game process occurring between two three-tier manipulators. The objective of the game construction lies in developing the procedure of effective control of the autonomous manipulator robots located in a common operating environment. The simulation model is a preliminary stage of building a natural complex that would provide cooperation of several manipulator robots within a common operating environment. The article addresses issues of training and research.
The Shapley value has become a popular method to attribute the prediction of a machine-learning model on an input to its base features. The Shapley value  is known to be the unique method that satisfies certain desirable properties, and this motivates its use. Unfortunately, despite this uniqueness result, there are a multiplicity of Shapley values used in explaining a model's prediction. This is because there are many ways to apply the Shapley value that differ in how they reference the model, the training data, and the explanation context. In this paper, we study an approach that applies the Shapley value to conditional expectations (CES) of sets of features (cf. ) that subsumes several prior approaches within a common framework. We provide the first algorithm for the general version of CES. We show that CES can result in counterintuitive attributions in theory and in practice (we study a diabetes prediction task); for instance, CES can assign non-zero attributions to features that are not referenced by the model. In contrast, we show that an approach called the Baseline Shapley (BS) does not exhibit counterintuitive attributions; we support this claim with a uniqueness (axiomatic) result. We show that BS is a special case of CES, and CES with an independent feature distribution coincides with a randomized version of BS. Thus, BS fits into the CES framework, but does not suffer from many of CES's deficiencies.
In abstract argumentation theory, many argumentation semantics have been proposed for evaluating argumentation frameworks. This paper is based on the following research question: Which semantics corresponds well to what humans consider a rational judgment on the acceptability of arguments? There are two systematic ways to approach this research question: A normative perspective is provided by the principle-based approach, in which semantics are evaluated based on their satisfaction of various normatively desirable principles. A descriptive perspective is provided by the empirical approach, in which cognitive studies are conducted to determine which semantics best predicts human judgments about arguments. In this paper, we combine both approaches to motivate a new argumentation semantics called SCF2. For this purpose, we introduce and motivate two new principles and show that no semantics from the literature satisfies both of them. We define SCF2 and prove that it satisfies both new principles. Furthermore, we discuss findings of a recent empirical cognitive study that provide additional support to SCF2.
Despite a multitude of empirical studies, little consensus exists on whether neural networks are able to generalise compositionally, a controversy that, in part, stems from a lack of agreement about what it means for a neural model to be compositional. As a response to this controversy, we present a set of tests that provide a bridge between, on the one hand, the vast amount of linguistic and philosophical theory about compositionality and, on the other, the successful neural models of language. We collect different interpretations of compositionality and translate them into five theoretically grounded tests that are formulated on a task-independent level. In particular, we provide tests to investigate (i) if models systematically recombine known parts and rules (ii) if models can extend their predictions beyond the length they have seen in the training data (iii) if models' composition operations are local or global (iv) if models' predictions are robust to synonym substitutions and (v) if models favour rules or exceptions during training. To demonstrate the usefulness of this evaluation paradigm, we instantiate these five tests on a highly compositional data set which we dub PCFG SET and apply the resulting tests to three popular sequence-to- sequence models: a recurrent, a convolution based and a transformer model. We provide an in depth analysis of the results, that uncover the strengths and weaknesses of these three architectures and point to potential areas of improvement.
Recommender Systems are nowadays successfully used by all major web sites (from e-commerce to social media) to filter content and make suggestions in a personalized way. Academic research largely focuses on the value of recommenders for consumers, e.g., in terms of reduced information overload. To what extent and in which ways recommender systems create business value is, however, much less clear, and the literature on the topic is scattered. In this research commentary, we review existing publications on field tests of recommender systems and report which business-related performance measures were used in such real-world deployments. We summarize common challenges of measuring the business value in practice and critically discuss the value of algorithmic improvements and offline experiments as commonly done in academic environments. Overall, our review indicates that various open questions remain both regarding the realistic quantification of the business effects of recommenders and the performance assessment of recommendation algorithms in academia.