Goto

Collaborating Authors

 Reinforcement Learning


Minimax Sample Complexity for Turn-based Stochastic Game

arXiv.org Machine Learning

The empirical success of Multi-agent reinforcement learning is encouraging, while few theoretical guarantees have been revealed. In this work, we prove that the plug-in solver approach, probably the most natural reinforcement learning algorithm, achieves minimax sample complexity for turn-based stochastic game (TBSG). Specifically, we plan in an empirical TBSG by utilizing a `simulator' that allows sampling from arbitrary state-action pair. We show that the empirical Nash equilibrium strategy is an approximate Nash equilibrium strategy in the true TBSG and give both problem-dependent and problem-independent bound. We develop absorbing TBSG and reward perturbation techniques to tackle the complex statistical dependence. The key idea is artificially introducing a suboptimality gap in TBSG and then the Nash equilibrium strategy lies in a finite set.


Human-Agent Cooperation in Bridge Bidding

arXiv.org Artificial Intelligence

We introduce a human-compatible reinforcement-learning approach to a cooperative game, making use of a third-party hand-coded human-compatible bot to generate initial training data and to perform initial evaluation. Our learning approach consists of imitation learning, search, and policy iteration. Our trained agents achieve a new state-of-the-art for bridge bidding in three settings: an agent playing in partnership with a copy of itself; an agent partnering a pre-existing bot; and an agent partnering a human player.


An open-ended learning architecture to face the REAL 2020 simulated robot competition

arXiv.org Artificial Intelligence

Open-ended learning is a core research field of machine learning and robotics aiming to build learning machines and robots able to autonomously acquire knowledge and skills and to reuse them to solve novel tasks. The multiple challenges posed by open-ended learning have been operationalized in the robotic competition REAL 2020. This requires a simulated camera-arm-gripper robot to (a) autonomously learn to interact with objects during an intrinsic phase where it can learn how to move objects and then (b) during an extrinsic phase, to re-use the acquired knowledge to accomplish externally given goals requiring the robot to move objects to specific locations unknown during the intrinsic phase. Here we present a 'baseline architecture' for solving the challenge, provided as baseline model for REAL 2020. Few models have all the functionalities needed to solve the REAL 2020 benchmark and none has been tested with it yet. The architecture we propose is formed by three components: (1) Abstractor: abstracting sensory input to learn relevant control variables from images; (2) Explorer: generating experience to learn goals and actions; (3) Planner: formulating and executing action plans to accomplish the externally provided goals. The architecture represents the first model to solve the simpler REAL 2020 'Round 1' allowing the use of a simple parameterised push action. On Round 2, the architecture was used with a more general action (sequence of joints positions) achieving again higher than chance level performance. The baseline software is well documented and available for download and use at https://github.com/AIcrowd/REAL2020_starter_kit.


Real-time Active Vision for a Humanoid Soccer Robot Using Deep Reinforcement Learning

arXiv.org Artificial Intelligence

In this paper, we present an active vision method using a deep reinforcement learning approach for a humanoid soccer-playing robot. The proposed method adaptively optimises the viewpoint of the robot to acquire the most useful landmarks for self-localisation while keeping the ball into its viewpoint. Active vision is critical for humanoid decision-maker robots with a limited field of view. To deal with an active vision problem, several probabilistic entropy-based approaches have previously been proposed which are highly dependent on the accuracy of the self-localisation model. However, in this research, we formulate the problem as an episodic reinforcement learning problem and employ a Deep Q-learning method to solve it. The proposed network only requires the raw images of the camera to move the robot's head toward the best viewpoint. The model shows a very competitive rate of 80% success rate in achieving the best viewpoint. We implemented the proposed method on a humanoid robot simulated in Webots simulator. Our evaluations and experimental results show that the proposed method outperforms the entropy-based methods in the RoboCup context, in cases with high self-localisation errors.


Improved Optimistic Algorithm For The Multinomial Logit Contextual Bandit

arXiv.org Machine Learning

We consider a dynamic assortment selection problem where the goal is to offer a sequence of assortments of cardinality at most $K$, out of $N$ items, to minimize the expected cumulative regret (loss of revenue). The feedback is given by a multinomial logit (MNL) choice model. This sequential decision making problem is studied under the MNL contextual bandit framework. The existing algorithms for MNL contexual bandit have frequentist regret guarantees as $\tilde{\mathrm{O}}(\kappa\sqrt{T})$, where $\kappa$ is an instance dependent constant. $\kappa$ could be arbitrarily large, e.g. exponentially dependent on the model parameters, causing the existing regret guarantees to be substantially loose. We propose an optimistic algorithm with a carefully designed exploration bonus term and show that it enjoys $\tilde{\mathrm{O}}(\sqrt{T})$ regret. In our bounds, the $\kappa$ factor only affects the poly-log term and not the leading term of the regret bounds.


Skill Transfer via Partially Amortized Hierarchical Planning

arXiv.org Machine Learning

To quickly solve new tasks in complex environments, intelligent agents need to build up reusable knowledge. For example, a learned world model captures knowledge about the environment that applies to new tasks. Similarly, skills capture general behaviors that can apply to new tasks. In this paper, we investigate how these two approaches can be integrated into a single reinforcement learning agent. Specifically, we leverage the idea of partial amortization for fast adaptation at test time. For this, actions are produced by a policy that is learned over time while the skills it conditions on are chosen using online planning. We demonstrate the benefits of our design decisions across a suite of challenging locomotion tasks and demonstrate improved sample efficiency in single tasks as well as in transfer from one task to another, as compared to competitive baselines. Videos are available at: https://sites.google.com/view/partial-amortization-hierarchy/home


Offline Learning from Demonstrations and Unlabeled Experience

arXiv.org Machine Learning

Behavior cloning (BC) is often practical for robot learning because it allows a policy to be trained offline without rewards, by supervised learning on expert demonstrations. However, BC does not effectively leverage what we will refer to as unlabeled experience: data of mixed and unknown quality without reward annotations. This unlabeled data can be generated by a variety of sources such as human teleoperation, scripted policies and other agents on the same robot. Towards data-driven offline robot learning that can use this unlabeled experience, we introduce Offline Reinforced Imitation Learning (ORIL). ORIL first learns a reward function by contrasting observations from demonstrator and unlabeled trajectories, then annotates all data with the learned reward, and finally trains an agent via offline reinforcement learning. Across a diverse set of continuous control and simulated robotic manipulation tasks, we show that ORIL consistently outperforms comparable BC agents by effectively leveraging unlabeled experience.


Protecting consumers from collusive prices due to AI

Science

The efficacy of a market system is rooted in competition. In striving to attract customers, firms are led to charge lower prices and deliver better products and services. Nothing more fundamentally undermines this process than collusion, when firms agree not to compete with one another and consequently consumers are harmed by higher prices. Collusion is generally condemned by economists and policy-makers and is unlawful in almost all countries. But the increasing delegation of price-setting to algorithms ([ 1 ][1]) has the potential for opening a back door through which firms could collude lawfully ([ 2 ][2]). Such algorithmic collusion can occur when artificial intelligence (AI) algorithms learn to adopt collusive pricing rules without human intervention, oversight, or even knowledge. This possibility poses a challenge for policy. To meet this challenge, we propose a direction for policy change and call for computer scientists, economists, and legal scholars to act in concert to operationalize the proposed change. Collusion among humans typically involves three stages (see the table). First, firms' employees with price-setting authority communicate with the intent of agreeing on a collusive rule of conduct. This rule encompasses a higher price and an arrangement to incentivize firms to comply with that higher price rather than undercut it in order to pick up more market share. For example, in 1995 the CEOs of Christie's and Sotheby's hatched their plans in a limo at Kennedy International Airport, and in 1994 the U.S. Federal Bureau of Investigation secretly taped the lysine cartel as they conspired in a Maui hotel room. At those meetings, they spoke about charging higher prices and how to enforce them. Second, successful communication results in the mutual adoption of a collusive rule of conduct, which commonly takes the form of a collusive pricing rule. A crucial component of this pricing rule is retaliatory pricing: Each firm raises its price and maintains that higher price under the threat of a “punishment,” such as a temporary price war, should it cheat and deviate from the higher price ([ 3 ][3]). It is this threat that sustains higher prices than would arise under competition. Third, firms set the higher prices that are the consequence of having adopted those collusive pricing rules. ![Figure][4] The process that produces higher prices To determine whether firms are colluding, one could look for evidence at any of the three stages. However, evidence related to the last two stages—pricing rules and higher prices—is generally regarded as insufficient to achieve the requisite level of confidence in the judicial realm. Economists know how to calculate competitive prices given demand, costs, and other relevant market conditions. But many of these factors are difficult to observe and, when observable, are challenging to measure with precision. Consequently, courts do not use the competitive price level as a benchmark to identify collusion. Likewise, it is difficult to assess whether the firms' rules of conduct are collusive because such rules are latent, residing in employees' heads. In practice, we may never observe the retaliatory lower prices from a firm that cheated, even though that response is there in the minds of the employees and it is the anticipation of such a response that sustains higher prices. In other words, we might lack the events that produce the data that could identify the collusive pricing rules. Furthermore, even if one could observe what looks like a price war, it would be difficult to rule out innocent explanations (such as a decrease in the firms' costs or a fall in demand). Given the latency of collusive pricing rules and the difficulty of determining whether prices are collusive or competitive, antitrust law and its enforcement have focused on the first stage: communications. Firms are found to be in violation of the law when communications (perhaps supplemented by other evidence) are sufficient to establish that firms have a “meeting of minds,” a “concurrence of wills,” or a “conscious commitment” that they will not compete ([ 4 ][5]). In the United States, more specifically, there must be evidence that one firm invited a competitor to collude and that the competitor accepted that invitation. The risk of false positives (i.e., wrongly finding firms guilty of collusion) has led courts to avoid basing their judgments on evidence of collusive pricing rules or collusive prices and instead to rely on evidence of communications. Although the use of pricing algorithms has a long history—airline companies, for instance, have been using revenue management software for decades—concerns regarding algorithmic collusion have only recently arisen for two reasons. First, pricing algorithms had once been based on pricing rules set by programmers but now often rely on AI systems that learn autonomously through active experimentation. After the programmer has set a goal, such as profit maximization, algorithms are capable of autonomously learning rules of conduct that achieve the goal, possibly with no human intervention. The enhanced sophistication of learning algorithms makes it more likely that AI systems will discover profit-enhancing collusive pricing rules, just as they have succeeded in discovering winning strategies in complex board games such as chess and Go ([ 5 ][6]). Second, a feature of online markets is that competitors' prices are available to a firm in real time. Such information is essential to the operation of collusive pricing rules. In order for firms to settle on some common higher price, firms' prices must be observed frequently enough because sustaining those higher prices requires the prospect of punishing a firm that deviates from the collusive agreement. The more quickly the punishment is meted out, the less temptation to cheat. Thus, the emergence and persistence of higher prices through collusion is facilitated by rapid detection of competitors' prices, which is now often possible in online markets. For example, the prices of products listed on Amazon may change several times per day but can be monitored with practically no delay. In light of these developments, concerns regarding the possibility of algorithmic collusion have been raised by government authorities, including the U.S. Federal Trade Commission (FTC) ([ 6 ][7]) and the European Commission ([ 7 ][8]). These concerns are justified, as enough evidence has accumulated that autonomous algorithmic collusion is a real risk. The evidence is both experimental and empirical. On the experimental side, recent research has found the spontaneous emergence of collusion in computer-simulated markets. In these studies, commonly used reinforcement-learning algorithms learned to initiate and sustain collusion in the context of well-accepted economic models of an industry ([ 8 ][9], [ 9 ][10]) (see the figure). Collusion arose with no human intervention other than instructing the AI-enabled learning algorithm to maximize profit (i.e., algorithms were not programmed to collude). Although the extent to which prices were higher in such virtual markets varied, prices were almost always substantially above the competitive level. On the empirical side, a recent study ([ 10 ][11]) has provided possible evidence of algorithmic collusion in Germany's retail gasoline markets. The delegation of pricing to algorithms was found to be associated with a substantial 20 to 30% increase in the markup of stations' prices over cost. Although the evidence is indirect—because the authors of the study could not directly observe the timing of adoption of the pricing algorithms and thus had to infer it from other data—their findings are consistent with the results of computer-simulated market experiments. Algorithmic collusion is as bad as human collusion. Consumers are harmed by the higher prices, irrespective of how firms arrive at charging these prices. However, should algorithmic collusion emerge in a market and be discovered, society lacks an effective defense to stop it. This is because algorithmic collusion does not involve the communications that have been the route to proving unlawful collusion (as distinguished from instances in which firms' employees might communicate and then collude with the assistance of algorithms, as in a recent case involving poster sellers on Amazon Marketplace). And even if alternative evidentiary approaches were to arise, there is no liability unless courts are prepared to conclude that AI has a “mind” or a “will” or is “conscious,” for otherwise there can be no “meeting of minds” with algorithmic collusion. As a result, if algorithmic collusion occurs and is discovered by the authorities, currently it cannot be considered a violation of antitrust or competition law. Society would then have no recourse and consumers would be forced to continue to suffer the harm from algorithmic collusion's higher prices. ![Figure][4] Collusive pricing rules uncovered After the two algorithms have found their way to collusive prices (“learning phase,” left side), an attempt to cheat so as to gain market share is simulated by exogenously forcing Firm 1's algorithm to cut its price (“punishment phase,” right side). From the “shock” period onward, the algorithm regains control of the pricing. Firm 1's deviation is punished by the other algorithm, so firms enter into a price war that lasts for several periods and then gradually ends as the algorithms return to pricing at a collusive level. For better graphical representation, the time scales on the right and left sides of the figure are different. GRAPHIC: N. CARY/ SCIENCE FROM CALVANO ET AL. ([ 8 ][9]) There is an alternative path, which is to target the collusive pricing rules learned by the algorithms that result in higher prices ([ 11 ][12]). These latent rules of conduct may be uncovered when they have been adopted by algorithms. Whereas a court cannot get inside the head of an employee to determine why prices are what they are, firms' pricing algorithms can be audited and tested in controlled environments. One can then simulate all sorts of possible deviations from existing prices and observe the algorithms' reaction in the absence of any confounding factor. In principle, the latent pricing rules can thus be identified precisely. This approach was successfully used by researchers in ([ 8 ][9]) to verify that the pricing algorithms have indeed learned the collusive property of reward (keeping prices high unless a price cut occurs) and punishment (through retaliatory price wars should a price cut occur). To show this, the researchers momentarily overrode the pricing algorithm of one firm, forcing it to set a lower price. As soon as the algorithms regained control of the pricing, they engaged in a temporary price war, where lower prices were charged but then gradually returned to the collusive level. Having learned that undercutting the other firm's price brings forth a price war (with the associated lower profits), the algorithms evolved to maintain high prices (see the figure). It may seem paradoxical that collusion can be identified by the low retaliatory prices, which could be close to the competitive level, rather than by the high prices that are the ultimate concern for policy. But there are two important differences between retaliatory price wars and healthy competition. First, in the absence of the low-price perturbation, the price war remains hypothetical in that it is a threat that is not executed. Second, the price war shown in the figure is only temporary: Instead of permanently reverting to the competitive price level, the algorithms gradually return to the pre-shock prices. This is evidence that the price war is there to support high prices, not to produce low prices. Focusing on the collusive pricing rules is the key to identifying, preventing, and prosecuting algorithmic collusion (see the table). Policy cannot target the higher prices directly, nor can it target communications as they may not be present (unlike with human collusion). But the retaliatory pricing rules may now be observable, as firms' pricing algorithms can be audited and tested. We therefore propose that antitrust policy shift its focus from communications (with humans) to rules of conduct (with algorithms). Making the proposed change operational involves a broad research program that requires the combined efforts of economists, computer scientists, and legal scholars. One strand of this program is a three-step experimental procedure. The first step creates collusion in the lab for descriptively realistic models of markets. As the competitive price would be known by the experimenter, collusion is identified by high prices. Having identified an episode of collusion, the second step is to perform a post hoc auditing exercise to uncover the properties of the collusive pricing rules that produced those high prices. Some progress has been made on the identification of collusive rules of conduct adopted by algorithms, but much more work needs to be done. Economics provides several properties to watch out for. Of course, there is the retaliatory price war discussed above, which is what existing research has focused on (8, 9). Another property is price matching, whereby firms' prices move in sync: one firm changing its price and the other firm subsequently matching that change. Price matching has been documented for human collusion in various markets, but we do not yet know whether algorithms are capable of learning it. A third property is the asymmetry of price responses. When firms collude, they typically respond to a competitor's price cut more strongly—as part of a punishment—than to a price increase. No such asymmetry is to be expected when firms compete. The aforementioned properties are based on economic theory and studies of human collusion. Learning algorithms may devise rules of conduct that neither economists nor managers have imagined ( just as learning algorithms have done, for instance, in chess). To investigate this possibility, computer scientists might develop algorithms that explain their own behavior, thereby making the collusive properties more apparent. One way of doing so is to add a second module to the reinforcement-learning module that maximizes profits; this second module maps the state representation of the first one onto a verbal explanation of its strategy ([ 12 ][13]). Having uncovered collusive pricing rules, the third step is to experiment with constraining the learning algorithm to prevent it from evolving to collusion. Computer scientists are particularly valuable here, given that they are involved in similar tasks such as trying to constrain algorithms so that, for instance, they do not exhibit racial and gender bias ([ 13 ][14]). Once the capacities to audit pricing algorithms for collusive properties and to constrain learning algorithms so that they do not adopt collusive pricing rules have been developed, legal scholars are called upon to use that knowledge for purposes of prosecution and prevention. One route is to make certain pricing algorithms unlawful, perhaps under Section 5 of the FTC Act, which prohibits unfair methods of competition. In the area of securities law, the 2017 case U.S. v. Michael Coscia made illegal the use of certain programmed trading rules and thus provides a legal precedent for prohibiting algorithms. Another path is to make firms legally responsible for the pricing rules that their learning algorithms adopt ([ 14 ][15]). Firms may then be incentivized to prevent collusion by routinely monitoring the output of their learning algorithms. These are some of the avenues that can be pursued for preventing and shutting down algorithmic collusion. There are several obstacles down the road, including the difficulty of making a collusive property test operational, the lack of transparency and interpretability of algorithms, and courts' willingness and ability to incorporate technical material of this nature. In addition, there is the challenge of addressing algorithmic collusion without giving up the efficiency gains from pricing algorithms such as the quicker response to changing market conditions. As authorities prepare to take action ([ 15 ][16]), it is vital that computer scientists, economists, and legal scholars work together to protect consumers from the potential harm of higher prices. 1. [↵][17]1. A. Ezrachi, 2. M. Stucke , Virtual Competition: The Promise and Perils of the Algorithm-Driven Economy (Harvard Univ. Press, 2016). 2. [↵][18]1. S. Mehra , Minn. Law Rev. 100, 1323 (2016). [OpenUrl][19] 3. [↵][20]1. J. Harrington , The Theory of Collusion and Competition Policy (MIT Press, 2017). 4. [↵][21]1. L. Kaplow , Competition Policy and Price Fixing (Princeton Univ. Press, 2013). 5. [↵][22]1. D. Silver et al ., Science 362, 1140 (2018). [OpenUrl][23][Abstract/FREE Full Text][24] 6. [↵][25]“The Competition and Consumer Protection Issues of Algorithms, Artificial Intelligence, and Predictive Analytics,” Hearing on Competition and Consumer Protection in the 21st Century, U.S. Federal Trade Commission, 13–14 November 2018; [www.ftc.gov/news-events/events-calendar/ftc-hearing-7-competition-consumer-protection-21st-century][26]. 7. [↵][27]“Algorithms and Collusion—Note from the European Union,” OECD Roundtable, June 2017; [www.oecd.org/competition/algorithms-and-collusion.htm][28]. 8. [↵][29]1. E. Calvano, 2. G. Calzolari, 3. V. Denicolo, 4. S. Pastorello , Am. Econ. Rev. 110, 3267 (2020). [OpenUrl][30] 9. [↵][31]1. T. Klein , “Autonomous Algorithmic Collusion: Q-Learning Under Sequential Pricing,” Amsterdam Law School Research Paper 2018-15 (2019). 10. [↵][32]1. S. Assad, 2. R. Clark, 3. D. Ershov, 4. L. Xu , “Algorithmic Pricing and Competition: Empirical Evidence from the German Retail Gasoline Market,” CESifo Working Paper No. 8521 (2020). 11. [↵][33]1. J. Harrington , J. Compet. Law Econ. 14, 331 (2018). [OpenUrl][34] 12. [↵][35]1. Z. C. Lipton , ACM Queue 16, 30 (2018). [OpenUrl][36] 13. [↵][37]1. P. S. Thomas et al ., Science 366, 999 (2019). [OpenUrl][38][Abstract/FREE Full Text][39] 14. [↵][40]1. S. Chopra, 2. L. White , A Legal Theory for Autonomous Artificial Agents (Univ. of Michigan Press, 2011). 15. [↵][41]European Commission, document Ares(2020)2877634. Acknowledgments: The paper benefited from detailed and insightful comments by three anonymous reviewers. All authors contributed equally. The authors declare no competing interests. [1]: #ref-1 [2]: #ref-2 [3]: #ref-3 [4]: pending:yes [5]: #ref-4 [6]: #ref-5 [7]: #ref-6 [8]: #ref-7 [9]: #ref-8 [10]: #ref-9 [11]: #ref-10 [12]: #ref-11 [13]: #ref-12 [14]: #ref-13 [15]: #ref-14 [16]: #ref-15 [17]: #xref-ref-1-1 "View reference 1 in text" [18]: #xref-ref-2-1 "View reference 2 in text" [19]: {openurl}?query=rft.jtitle%253DMinn.%2BLaw%2BRev.%26rft.volume%253D100%26rft.spage%253D1323%26rft.genre%253Darticle%26rft_val_fmt%253Dinfo%253Aofi%252Ffmt%253Akev%253Amtx%253Ajournal%26ctx_ver%253DZ39.88-2004%26url_ver%253DZ39.88-2004%26url_ctx_fmt%253Dinfo%253Aofi%252Ffmt%253Akev%253Amtx%253Actx [20]: #xref-ref-3-1 "View reference 3 in text" [21]: #xref-ref-4-1 "View reference 4 in text" [22]: #xref-ref-5-1 "View reference 5 in text" [23]: {openurl}?query=rft.jtitle%253DScience%26rft.stitle%253DScience%26rft.aulast%253DSilver%26rft.auinit1%253DD.%26rft.volume%253D362%26rft.issue%253D6419%26rft.spage%253D1140%26rft.epage%253D1144%26rft.atitle%253DA%2Bgeneral%2Breinforcement%2Blearning%2Balgorithm%2Bthat%2Bmasters%2Bchess%252C%2Bshogi%252C%2Band%2BGo%2Bthrough%2Bself-play%26rft_id%253Dinfo%253Adoi%252F10.1126%252Fscience.aar6404%26rft_id%253Dinfo%253Apmid%252F30523106%26rft.genre%253Darticle%26rft_val_fmt%253Dinfo%253Aofi%252Ffmt%253Akev%253Amtx%253Ajournal%26ctx_ver%253DZ39.88-2004%26url_ver%253DZ39.88-2004%26url_ctx_fmt%253Dinfo%253Aofi%252Ffmt%253Akev%253Amtx%253Actx [24]: /lookup/ijlink/YTozOntzOjQ6InBhdGgiO3M6MTQ6Ii9sb29rdXAvaWpsaW5rIjtzOjU6InF1ZXJ5IjthOjQ6e3M6ODoibGlua1R5cGUiO3M6NDoiQUJTVCI7czoxMToiam91cm5hbENvZGUiO3M6Mzoic2NpIjtzOjU6InJlc2lkIjtzOjEzOiIzNjIvNjQxOS8xMTQwIjtzOjQ6ImF0b20iO3M6MjM6Ii9zY2kvMzcwLzY1MjAvMTA0MC5hdG9tIjt9czo4OiJmcmFnbWVudCI7czowOiIiO30= [25]: #xref-ref-6-1 "View reference 6 in text" [26]: http://www.ftc.gov/news-events/events-calendar/ftc-hearing-7-competition-consumer-protection-21st-century [27]: #xref-ref-7-1 "View reference 7 in text" [28]: http://www.oecd.org/competition/algorithms-and-collusion.htm [29]: #xref-ref-8-1 "View reference 8 in text" [30]: {openurl}?query=rft.jtitle%253DAm.%2BEcon.%2BRev.%26rft.volume%253D110%26rft.spage%253D3267%26rft.genre%253Darticle%26rft_val_fmt%253Dinfo%253Aofi%252Ffmt%253Akev%253Amtx%253Ajournal%26ctx_ver%253DZ39.88-2004%26url_ver%253DZ39.88-2004%26url_ctx_fmt%253Dinfo%253Aofi%252Ffmt%253Akev%253Amtx%253Actx [31]: #xref-ref-9-1 "View reference 9 in text" [32]: #xref-ref-10-1 "View reference 10 in text" [33]: #xref-ref-11-1 "View reference 11 in text" [34]: {openurl}?query=rft.jtitle%253DJ.%2BCompet.%2BLaw%2BEcon.%26rft.volume%253D14%26rft.spage%253D331%26rft.genre%253Darticle%26rft_val_fmt%253Dinfo%253Aofi%252Ffmt%253Akev%253Amtx%253Ajournal%26ctx_ver%253DZ39.88-2004%26url_ver%253DZ39.88-2004%26url_ctx_fmt%253Dinfo%253Aofi%252Ffmt%253Akev%253Amtx%253Actx [35]: #xref-ref-12-1 "View reference 12 in text" [36]: {openurl}?query=rft.jtitle%253DACM%2BQueue%26rft.volume%253D16%26rft.spage%253D30%26rft.genre%253Darticle%26rft_val_fmt%253Dinfo%253Aofi%252Ffmt%253Akev%253Amtx%253Ajournal%26ctx_ver%253DZ39.88-2004%26url_ver%253DZ39.88-2004%26url_ctx_fmt%253Dinfo%253Aofi%252Ffmt%253Akev%253Amtx%253Actx [37]: #xref-ref-13-1 "View reference 13 in text" [38]: {openurl}?query=rft.jtitle%253DScience%26rft.stitle%253DScience%26rft.aulast%253DThomas%26rft.auinit1%253DP.%2BS.%26rft.volume%253D366%26rft.issue%253D6468%26rft.spage%253D999%26rft.epage%253D1004%26rft.atitle%253DPreventing%2Bundesirable%2Bbehavior%2Bof%2Bintelligent%2Bmachines%26rft_id%253Dinfo%253Adoi%252F10.1126%252Fscience.aag3311%26rft_id%253Dinfo%253Apmid%252F31754000%26rft.genre%253Darticle%26rft_val_fmt%253Dinfo%253Aofi%252Ffmt%253Akev%253Amtx%253Ajournal%26ctx_ver%253DZ39.88-2004%26url_ver%253DZ39.88-2004%26url_ctx_fmt%253Dinfo%253Aofi%252Ffmt%253Akev%253Amtx%253Actx [39]: /lookup/ijlink/YTozOntzOjQ6InBhdGgiO3M6MTQ6Ii9sb29rdXAvaWpsaW5rIjtzOjU6InF1ZXJ5IjthOjQ6e3M6ODoibGlua1R5cGUiO3M6NDoiQUJTVCI7czoxMToiam91cm5hbENvZGUiO3M6Mzoic2NpIjtzOjU6InJlc2lkIjtzOjEyOiIzNjYvNjQ2OC85OTkiO3M6NDoiYXRvbSI7czoyMzoiL3NjaS8zNzAvNjUyMC8xMDQwLmF0b20iO31zOjg6ImZyYWdtZW50IjtzOjA6IiI7fQ== [40]: #xref-ref-14-1 "View reference 14 in text" [41]: #xref-ref-15-1 "View reference 15 in text"


DeepMind Just Gave Away This AI Environment Simulator For Free

#artificialintelligence

Last week, Alphabet subsidiary, DeepMind open-sourced Lab2D, which the researchers explained to be a scalable environment simulator for artificial intelligence research, helping them to create 2D environments for AI and ML research. Researchers claim that it facilitates researcher-led experimentation with environment design while also helping them understand the influence of environments in multi-agent reinforcement learning. While it was built with the specific needs of multi-agent deep reinforcement learning researchers in mind, it can be used beyond that particular subfield. In this article, we take a deeper look into what DeepMind Lab2D is all about and how it can help AI researchers. As researchers explain in the paper, DeepMind Lab2D (or "DMLab2D" for short) is a platform for the creation of two-dimensional, layered, discrete "grid-world" environments, in which the pieces -- which can be compared to chess pieces on a chessboard -- move around. This system is particularly tailored for multi-agent reinforcement learning.


Reinforcement Learning for Robust Missile Autopilot Design

arXiv.org Artificial Intelligence

Designing missiles' autopilot controllers has been a complex task, given the extensive flight envelope and the nonlinear flight dynamics. A solution that can excel both in nominal performance and in robustness to uncertainties is still to be found. While Control Theory often debouches into parameters' scheduling procedures, Reinforcement Learning has presented interesting results in ever more complex tasks, going from videogames to robotic tasks with continuous action domains. However, it still lacks clearer insights on how to find adequate reward functions and exploration strategies. To the best of our knowledge, this work is pioneer in proposing Reinforcement Learning as a framework for flight control. In fact, it aims at training a model-free agent that can control the longitudinal flight of a missile, achieving optimal performance and robustness to uncertainties. To that end, under TRPO's methodology, the collected experience is augmented according to HER, stored in a replay buffer and sampled according to its significance. Not only does this work enhance the concept of prioritized experience replay into BPER, but it also reformulates HER, activating them both only when the training progress converges to suboptimal policies, in what is proposed as the SER methodology. Besides, the Reward Engineering process is carefully detailed. The results show that it is possible both to achieve the optimal performance and to improve the agent's robustness to uncertainties (with low damage on nominal performance) by further training it in non-nominal environments, therefore validating the proposed approach and encouraging future research in this field.