Collaborating Authors


A learning perspective on the emergence of abstractions: the curious case of phonemes Machine Learning

In the present paper we use a range of modeling techniques to investigate whether an abstract phone could emerge from exposure to speech sounds. We test two opposing principles regarding the development of language knowledge in linguistically untrained language users: Memory-Based Learning (MBL) and Error-Correction Learning (ECL). A process of generalization underlies the abstractions linguists operate with, and we probed whether MBL and ECL could give rise to a type of language knowledge that resembles linguistic abstractions. Each model was presented with a significant amount of pre-processed speech produced by one speaker. We assessed the consistency or stability of what the models have learned and their ability to give rise to abstract categories. Both types of models fare differently with regard to these tests. We show that ECL learning models can learn abstractions and that at least part of the phone inventory can be reliably identified from the input.

A Reinforcement Learning Formulation of the Lyapunov Optimization: Application to Edge Computing Systems with Queue Stability Artificial Intelligence

In this paper, a deep reinforcement learning (DRL)-based approach to the Lyapunov optimization is considered to minimize the time-average penalty while maintaining queue stability. A proper construction of state and action spaces is provided to form a proper Markov decision process (MDP) for the Lyapunov optimization. A condition for the reward function of reinforcement learning (RL) for queue stability is derived. Based on the analysis and practical RL with reward discounting, a class of reward functions is proposed for the DRL-based approach to the Lyapunov optimization. The proposed DRL-based approach to the Lyapunov optimization does not required complicated optimization at each time step and operates with general non-convex and discontinuous penalty functions. Hence, it provides an alternative to the conventional drift-plus-penalty (DPP) algorithm for the Lyapunov optimization. The proposed DRL-based approach is applied to resource allocation in edge computing systems with queue stability and numerical results demonstrate its successful operation.

Active Hierarchical Imitation and Reinforcement Learning Artificial Intelligence

Humans can leverage hierarchical structures to split a task into sub-tasks and solve problems efficiently. Both imitation and reinforcement learning or a combination of them with hierarchical structures have been proven to be an efficient way for robots to learn complex tasks with sparse rewards. However, in the previous work of hierarchical imitation and reinforcement learning, the tested environments are in relatively simple 2D games, and the action spaces are discrete. Furthermore, many imitation learning works focusing on improving the policies learned from the expert polices that are hard-coded or trained by reinforcement learning algorithms, rather than human experts. In the scenarios of human-robot interaction, humans can be required to provide demonstrations to teach the robot, so it is crucial to improve the learning efficiency to reduce expert efforts, and know human's perception about the learning/training process. In this project, we explored different imitation learning algorithms and designed active learning algorithms upon the hierarchical imitation and reinforcement learning framework we have developed. We performed an experiment where five participants were asked to guide a randomly initialized agent to a random goal in a maze. Our experimental results showed that using DAgger and reward-based active learning method can achieve better performance while saving more human efforts physically and mentally during the training process.

Emergence of Different Modes of Tool Use in a Reaching and Dragging Task Artificial Intelligence

Tool use is an important milestone in the evolution of intelligence. In this paper, we investigate different modes of tool use that emerge in a reaching and dragging task. In this task, a jointed arm with a gripper must grab a tool (T, I, or L-shaped) and drag an object down to the target location (the bottom of the arena). The simulated environment had real physics such as gravity and friction. We trained a deep-reinforcement learning based controller (with raw visual and proprioceptive input) with minimal reward shaping information to tackle this task. We observed the emergence of a wide range of unexpected behaviors, not directly encoded in the motor primitives or reward functions. Examples include hitting the object to the target location, correcting error of initial contact, throwing the tool toward the object, as well as normal expected behavior such as wide sweep. Also, we further analyzed these behaviors based on the type of tool and the initial position of the target object. Our results show a rich repertoire of behaviors, beyond the basic built-in mechanisms of the deep reinforcement learning method we used.

DERAIL: Diagnostic Environments for Reward And Imitation Learning Artificial Intelligence

The objective of many real-world tasks is complex and difficult to procedurally specify. This makes it necessary to use reward or imitation learning algorithms to infer a reward or policy directly from human data. Existing benchmarks for these algorithms focus on realism, testing in complex environments. Unfortunately, these benchmarks are slow, unreliable and cannot isolate failures. As a complementary approach, we develop a suite of simple diagnostic tasks that test individual facets of algorithm performance in isolation. We evaluate a range of common reward and imitation learning algorithms on our tasks. Our results confirm that algorithm performance is highly sensitive to implementation details. Moreover, in a case-study into a popular preference-based reward learning implementation, we illustrate how the suite can pinpoint design flaws and rapidly evaluate candidate solutions. The environments are available at .

A Multi-intersection Vehicular Cooperative Control based on End-Edge-Cloud Computing Artificial Intelligence

Cooperative Intelligent Transportation Systems (C-ITS) will change the modes of road safety and traffic management, especially at intersections without traffic lights, namely unsignalized intersections. Existing researches focus on vehicle control within a small area around an unsignalized intersection. In this paper, we expand the control domain to a large area with multiple intersections. In particular, we propose a Multi-intersection Vehicular Cooperative Control (MiVeCC) to enable cooperation among vehicles in a large area with multiple unsignalized intersections. Firstly, a vehicular end-edge-cloud computing framework is proposed to facilitate end-edge-cloud vertical cooperation and horizontal cooperation among vehicles. Then, the vehicular cooperative control problems in the cloud and edge layers are formulated as Markov Decision Process (MDP) and solved by two-stage reinforcement learning. Furthermore, to deal with high-density traffic, vehicle selection methods are proposed to reduce the state space and accelerate algorithm convergence without performance degradation. A multi-intersection simulation platform is developed to evaluate the proposed scheme. Simulation results show that the proposed MiVeCC can improve travel efficiency at multiple intersections by up to 4.59 times without collision compared with existing methods.

Connecting Context-specific Adaptation in Humans to Meta-learning Artificial Intelligence

Cognitive control, the ability of a system to adapt to the demands of a task, is an integral part of cognition. A widely accepted fact about cognitive control is that it is context-sensitive: Adults and children alike infer information about a task's demands from contextual cues and use these inferences to learn from ambiguous cues. However, the precise way in which people use contextual cues to guide adaptation to a new task remains poorly understood. This work connects the context-sensitive nature of cognitive control to a method for meta-learning with context-conditioned adaptation. We begin by identifying an essential difference between human learning and current approaches to meta-learning: In contrast to humans, existing meta-learning algorithms do not make use of task-specific contextual cues but instead rely exclusively on online feedback in the form of task-specific labels or rewards. To remedy this, we introduce a framework for using contextual information about a task to guide the initialization of task-specific models before adaptation to online feedback. We show how context-conditioned meta-learning can capture human behavior in a cognitive task and how it can be scaled to improve the speed of learning in various settings, including few-shot classification and low-sample reinforcement learning. Our work demonstrates that guiding meta-learning with task information can capture complex, human-like behavior, thereby deepening our understanding of cognitive control.

Deep Reinforcement Learning for Crowdsourced Urban Delivery: System States Characterization, Heuristics-guided Action Choice, and Rule-Interposing Integration Artificial Intelligence

This paper investigates the problem of assigning shipping requests to ad hoc couriers in the context of crowdsourced urban delivery. The shipping requests are spatially distributed each with a limited time window between the earliest time for pickup and latest time for delivery. The ad hoc couriers, termed crowdsourcees, also have limited time availability and carrying capacity. We propose a new deep reinforcement learning (DRL)-based approach to tackling this assignment problem. A deep Q network (DQN) algorithm is trained which entails two salient features of experience replay and target network that enhance the efficiency, convergence, and stability of DRL training. More importantly, this paper makes three methodological contributions: 1) presenting a comprehensive and novel characterization of crowdshipping system states that encompasses spatial-temporal and capacity information of crowdsourcees and requests; 2) embedding heuristics that leverage the information offered by the state representation and are based on intuitive reasoning to guide specific actions to take, to preserve tractability and enhance efficiency of training; and 3) integrating rule-interposing to prevent repeated visiting of the same routes and node sequences during routing improvement, thereby further enhancing the training efficiency by accelerating learning. The effectiveness of the proposed approach is demonstrated through extensive numerical analysis. The results show the benefits brought by the heuristics-guided action choice and rule-interposing in DRL training, and the superiority of the proposed approach over existing heuristics in both solution quality, time, and scalability. Besides the potential to improve the efficiency of crowdshipping operation planning, the proposed approach also provides a new avenue and generic framework for other problems in the vehicle routing context.

Protecting consumers from collusive prices due to AI


The efficacy of a market system is rooted in competition. In striving to attract customers, firms are led to charge lower prices and deliver better products and services. Nothing more fundamentally undermines this process than collusion, when firms agree not to compete with one another and consequently consumers are harmed by higher prices. Collusion is generally condemned by economists and policy-makers and is unlawful in almost all countries. But the increasing delegation of price-setting to algorithms ([ 1 ][1]) has the potential for opening a back door through which firms could collude lawfully ([ 2 ][2]). Such algorithmic collusion can occur when artificial intelligence (AI) algorithms learn to adopt collusive pricing rules without human intervention, oversight, or even knowledge. This possibility poses a challenge for policy. To meet this challenge, we propose a direction for policy change and call for computer scientists, economists, and legal scholars to act in concert to operationalize the proposed change. Collusion among humans typically involves three stages (see the table). First, firms' employees with price-setting authority communicate with the intent of agreeing on a collusive rule of conduct. This rule encompasses a higher price and an arrangement to incentivize firms to comply with that higher price rather than undercut it in order to pick up more market share. For example, in 1995 the CEOs of Christie's and Sotheby's hatched their plans in a limo at Kennedy International Airport, and in 1994 the U.S. Federal Bureau of Investigation secretly taped the lysine cartel as they conspired in a Maui hotel room. At those meetings, they spoke about charging higher prices and how to enforce them. Second, successful communication results in the mutual adoption of a collusive rule of conduct, which commonly takes the form of a collusive pricing rule. A crucial component of this pricing rule is retaliatory pricing: Each firm raises its price and maintains that higher price under the threat of a “punishment,” such as a temporary price war, should it cheat and deviate from the higher price ([ 3 ][3]). It is this threat that sustains higher prices than would arise under competition. Third, firms set the higher prices that are the consequence of having adopted those collusive pricing rules. ![Figure][4] The process that produces higher prices To determine whether firms are colluding, one could look for evidence at any of the three stages. However, evidence related to the last two stages—pricing rules and higher prices—is generally regarded as insufficient to achieve the requisite level of confidence in the judicial realm. Economists know how to calculate competitive prices given demand, costs, and other relevant market conditions. But many of these factors are difficult to observe and, when observable, are challenging to measure with precision. Consequently, courts do not use the competitive price level as a benchmark to identify collusion. Likewise, it is difficult to assess whether the firms' rules of conduct are collusive because such rules are latent, residing in employees' heads. In practice, we may never observe the retaliatory lower prices from a firm that cheated, even though that response is there in the minds of the employees and it is the anticipation of such a response that sustains higher prices. In other words, we might lack the events that produce the data that could identify the collusive pricing rules. Furthermore, even if one could observe what looks like a price war, it would be difficult to rule out innocent explanations (such as a decrease in the firms' costs or a fall in demand). Given the latency of collusive pricing rules and the difficulty of determining whether prices are collusive or competitive, antitrust law and its enforcement have focused on the first stage: communications. Firms are found to be in violation of the law when communications (perhaps supplemented by other evidence) are sufficient to establish that firms have a “meeting of minds,” a “concurrence of wills,” or a “conscious commitment” that they will not compete ([ 4 ][5]). In the United States, more specifically, there must be evidence that one firm invited a competitor to collude and that the competitor accepted that invitation. The risk of false positives (i.e., wrongly finding firms guilty of collusion) has led courts to avoid basing their judgments on evidence of collusive pricing rules or collusive prices and instead to rely on evidence of communications. Although the use of pricing algorithms has a long history—airline companies, for instance, have been using revenue management software for decades—concerns regarding algorithmic collusion have only recently arisen for two reasons. First, pricing algorithms had once been based on pricing rules set by programmers but now often rely on AI systems that learn autonomously through active experimentation. After the programmer has set a goal, such as profit maximization, algorithms are capable of autonomously learning rules of conduct that achieve the goal, possibly with no human intervention. The enhanced sophistication of learning algorithms makes it more likely that AI systems will discover profit-enhancing collusive pricing rules, just as they have succeeded in discovering winning strategies in complex board games such as chess and Go ([ 5 ][6]). Second, a feature of online markets is that competitors' prices are available to a firm in real time. Such information is essential to the operation of collusive pricing rules. In order for firms to settle on some common higher price, firms' prices must be observed frequently enough because sustaining those higher prices requires the prospect of punishing a firm that deviates from the collusive agreement. The more quickly the punishment is meted out, the less temptation to cheat. Thus, the emergence and persistence of higher prices through collusion is facilitated by rapid detection of competitors' prices, which is now often possible in online markets. For example, the prices of products listed on Amazon may change several times per day but can be monitored with practically no delay. In light of these developments, concerns regarding the possibility of algorithmic collusion have been raised by government authorities, including the U.S. Federal Trade Commission (FTC) ([ 6 ][7]) and the European Commission ([ 7 ][8]). These concerns are justified, as enough evidence has accumulated that autonomous algorithmic collusion is a real risk. The evidence is both experimental and empirical. On the experimental side, recent research has found the spontaneous emergence of collusion in computer-simulated markets. In these studies, commonly used reinforcement-learning algorithms learned to initiate and sustain collusion in the context of well-accepted economic models of an industry ([ 8 ][9], [ 9 ][10]) (see the figure). Collusion arose with no human intervention other than instructing the AI-enabled learning algorithm to maximize profit (i.e., algorithms were not programmed to collude). Although the extent to which prices were higher in such virtual markets varied, prices were almost always substantially above the competitive level. On the empirical side, a recent study ([ 10 ][11]) has provided possible evidence of algorithmic collusion in Germany's retail gasoline markets. The delegation of pricing to algorithms was found to be associated with a substantial 20 to 30% increase in the markup of stations' prices over cost. Although the evidence is indirect—because the authors of the study could not directly observe the timing of adoption of the pricing algorithms and thus had to infer it from other data—their findings are consistent with the results of computer-simulated market experiments. Algorithmic collusion is as bad as human collusion. Consumers are harmed by the higher prices, irrespective of how firms arrive at charging these prices. However, should algorithmic collusion emerge in a market and be discovered, society lacks an effective defense to stop it. This is because algorithmic collusion does not involve the communications that have been the route to proving unlawful collusion (as distinguished from instances in which firms' employees might communicate and then collude with the assistance of algorithms, as in a recent case involving poster sellers on Amazon Marketplace). And even if alternative evidentiary approaches were to arise, there is no liability unless courts are prepared to conclude that AI has a “mind” or a “will” or is “conscious,” for otherwise there can be no “meeting of minds” with algorithmic collusion. As a result, if algorithmic collusion occurs and is discovered by the authorities, currently it cannot be considered a violation of antitrust or competition law. Society would then have no recourse and consumers would be forced to continue to suffer the harm from algorithmic collusion's higher prices. ![Figure][4] Collusive pricing rules uncovered After the two algorithms have found their way to collusive prices (“learning phase,” left side), an attempt to cheat so as to gain market share is simulated by exogenously forcing Firm 1's algorithm to cut its price (“punishment phase,” right side). From the “shock” period onward, the algorithm regains control of the pricing. Firm 1's deviation is punished by the other algorithm, so firms enter into a price war that lasts for several periods and then gradually ends as the algorithms return to pricing at a collusive level. For better graphical representation, the time scales on the right and left sides of the figure are different. GRAPHIC: N. CARY/ SCIENCE FROM CALVANO ET AL. ([ 8 ][9]) There is an alternative path, which is to target the collusive pricing rules learned by the algorithms that result in higher prices ([ 11 ][12]). These latent rules of conduct may be uncovered when they have been adopted by algorithms. Whereas a court cannot get inside the head of an employee to determine why prices are what they are, firms' pricing algorithms can be audited and tested in controlled environments. One can then simulate all sorts of possible deviations from existing prices and observe the algorithms' reaction in the absence of any confounding factor. In principle, the latent pricing rules can thus be identified precisely. This approach was successfully used by researchers in ([ 8 ][9]) to verify that the pricing algorithms have indeed learned the collusive property of reward (keeping prices high unless a price cut occurs) and punishment (through retaliatory price wars should a price cut occur). To show this, the researchers momentarily overrode the pricing algorithm of one firm, forcing it to set a lower price. As soon as the algorithms regained control of the pricing, they engaged in a temporary price war, where lower prices were charged but then gradually returned to the collusive level. Having learned that undercutting the other firm's price brings forth a price war (with the associated lower profits), the algorithms evolved to maintain high prices (see the figure). It may seem paradoxical that collusion can be identified by the low retaliatory prices, which could be close to the competitive level, rather than by the high prices that are the ultimate concern for policy. But there are two important differences between retaliatory price wars and healthy competition. First, in the absence of the low-price perturbation, the price war remains hypothetical in that it is a threat that is not executed. Second, the price war shown in the figure is only temporary: Instead of permanently reverting to the competitive price level, the algorithms gradually return to the pre-shock prices. This is evidence that the price war is there to support high prices, not to produce low prices. Focusing on the collusive pricing rules is the key to identifying, preventing, and prosecuting algorithmic collusion (see the table). Policy cannot target the higher prices directly, nor can it target communications as they may not be present (unlike with human collusion). But the retaliatory pricing rules may now be observable, as firms' pricing algorithms can be audited and tested. We therefore propose that antitrust policy shift its focus from communications (with humans) to rules of conduct (with algorithms). Making the proposed change operational involves a broad research program that requires the combined efforts of economists, computer scientists, and legal scholars. One strand of this program is a three-step experimental procedure. The first step creates collusion in the lab for descriptively realistic models of markets. As the competitive price would be known by the experimenter, collusion is identified by high prices. Having identified an episode of collusion, the second step is to perform a post hoc auditing exercise to uncover the properties of the collusive pricing rules that produced those high prices. Some progress has been made on the identification of collusive rules of conduct adopted by algorithms, but much more work needs to be done. Economics provides several properties to watch out for. Of course, there is the retaliatory price war discussed above, which is what existing research has focused on (8, 9). Another property is price matching, whereby firms' prices move in sync: one firm changing its price and the other firm subsequently matching that change. Price matching has been documented for human collusion in various markets, but we do not yet know whether algorithms are capable of learning it. A third property is the asymmetry of price responses. When firms collude, they typically respond to a competitor's price cut more strongly—as part of a punishment—than to a price increase. No such asymmetry is to be expected when firms compete. The aforementioned properties are based on economic theory and studies of human collusion. Learning algorithms may devise rules of conduct that neither economists nor managers have imagined ( just as learning algorithms have done, for instance, in chess). To investigate this possibility, computer scientists might develop algorithms that explain their own behavior, thereby making the collusive properties more apparent. One way of doing so is to add a second module to the reinforcement-learning module that maximizes profits; this second module maps the state representation of the first one onto a verbal explanation of its strategy ([ 12 ][13]). Having uncovered collusive pricing rules, the third step is to experiment with constraining the learning algorithm to prevent it from evolving to collusion. Computer scientists are particularly valuable here, given that they are involved in similar tasks such as trying to constrain algorithms so that, for instance, they do not exhibit racial and gender bias ([ 13 ][14]). Once the capacities to audit pricing algorithms for collusive properties and to constrain learning algorithms so that they do not adopt collusive pricing rules have been developed, legal scholars are called upon to use that knowledge for purposes of prosecution and prevention. One route is to make certain pricing algorithms unlawful, perhaps under Section 5 of the FTC Act, which prohibits unfair methods of competition. In the area of securities law, the 2017 case U.S. v. Michael Coscia made illegal the use of certain programmed trading rules and thus provides a legal precedent for prohibiting algorithms. Another path is to make firms legally responsible for the pricing rules that their learning algorithms adopt ([ 14 ][15]). Firms may then be incentivized to prevent collusion by routinely monitoring the output of their learning algorithms. These are some of the avenues that can be pursued for preventing and shutting down algorithmic collusion. There are several obstacles down the road, including the difficulty of making a collusive property test operational, the lack of transparency and interpretability of algorithms, and courts' willingness and ability to incorporate technical material of this nature. In addition, there is the challenge of addressing algorithmic collusion without giving up the efficiency gains from pricing algorithms such as the quicker response to changing market conditions. As authorities prepare to take action ([ 15 ][16]), it is vital that computer scientists, economists, and legal scholars work together to protect consumers from the potential harm of higher prices. 1. [↵][17]1. A. Ezrachi, 2. M. Stucke , Virtual Competition: The Promise and Perils of the Algorithm-Driven Economy (Harvard Univ. Press, 2016). 2. [↵][18]1. S. Mehra , Minn. Law Rev. 100, 1323 (2016). [OpenUrl][19] 3. [↵][20]1. J. Harrington , The Theory of Collusion and Competition Policy (MIT Press, 2017). 4. [↵][21]1. L. Kaplow , Competition Policy and Price Fixing (Princeton Univ. Press, 2013). 5. [↵][22]1. D. Silver et al ., Science 362, 1140 (2018). [OpenUrl][23][Abstract/FREE Full Text][24] 6. [↵][25]“The Competition and Consumer Protection Issues of Algorithms, Artificial Intelligence, and Predictive Analytics,” Hearing on Competition and Consumer Protection in the 21st Century, U.S. Federal Trade Commission, 13–14 November 2018; [][26]. 7. [↵][27]“Algorithms and Collusion—Note from the European Union,” OECD Roundtable, June 2017; [][28]. 8. [↵][29]1. E. Calvano, 2. G. Calzolari, 3. V. Denicolo, 4. S. Pastorello , Am. Econ. Rev. 110, 3267 (2020). [OpenUrl][30] 9. [↵][31]1. T. Klein , “Autonomous Algorithmic Collusion: Q-Learning Under Sequential Pricing,” Amsterdam Law School Research Paper 2018-15 (2019). 10. [↵][32]1. S. Assad, 2. R. Clark, 3. D. Ershov, 4. L. Xu , “Algorithmic Pricing and Competition: Empirical Evidence from the German Retail Gasoline Market,” CESifo Working Paper No. 8521 (2020). 11. [↵][33]1. J. Harrington , J. Compet. Law Econ. 14, 331 (2018). [OpenUrl][34] 12. [↵][35]1. Z. C. Lipton , ACM Queue 16, 30 (2018). [OpenUrl][36] 13. [↵][37]1. P. S. Thomas et al ., Science 366, 999 (2019). [OpenUrl][38][Abstract/FREE Full Text][39] 14. [↵][40]1. S. Chopra, 2. L. White , A Legal Theory for Autonomous Artificial Agents (Univ. of Michigan Press, 2011). 15. [↵][41]European Commission, document Ares(2020)2877634. Acknowledgments: The paper benefited from detailed and insightful comments by three anonymous reviewers. All authors contributed equally. The authors declare no competing interests. [1]: #ref-1 [2]: #ref-2 [3]: #ref-3 [4]: pending:yes [5]: #ref-4 [6]: #ref-5 [7]: #ref-6 [8]: #ref-7 [9]: #ref-8 [10]: #ref-9 [11]: #ref-10 [12]: #ref-11 [13]: #ref-12 [14]: #ref-13 [15]: #ref-14 [16]: #ref-15 [17]: #xref-ref-1-1 "View reference 1 in text" [18]: #xref-ref-2-1 "View reference 2 in text" [19]: {openurl}?query=rft.jtitle%253DMinn.%2BLaw%2BRev.%26rft.volume%253D100%26rft.spage%253D1323%26rft.genre%253Darticle%26rft_val_fmt%253Dinfo%253Aofi%252Ffmt%253Akev%253Amtx%253Ajournal%26ctx_ver%253DZ39.88-2004%26url_ver%253DZ39.88-2004%26url_ctx_fmt%253Dinfo%253Aofi%252Ffmt%253Akev%253Amtx%253Actx [20]: #xref-ref-3-1 "View reference 3 in text" [21]: #xref-ref-4-1 "View reference 4 in text" [22]: #xref-ref-5-1 "View reference 5 in text" [23]: {openurl}?query=rft.jtitle%253DScience%26rft.stitle%253DScience%26rft.aulast%253DSilver%26rft.auinit1%253DD.%26rft.volume%253D362%26rft.issue%253D6419%26rft.spage%253D1140%26rft.epage%253D1144%26rft.atitle%253DA%2Bgeneral%2Breinforcement%2Blearning%2Balgorithm%2Bthat%2Bmasters%2Bchess%252C%2Bshogi%252C%2Band%2BGo%2Bthrough%2Bself-play%26rft_id%253Dinfo%253Adoi%252F10.1126%252Fscience.aar6404%26rft_id%253Dinfo%253Apmid%252F30523106%26rft.genre%253Darticle%26rft_val_fmt%253Dinfo%253Aofi%252Ffmt%253Akev%253Amtx%253Ajournal%26ctx_ver%253DZ39.88-2004%26url_ver%253DZ39.88-2004%26url_ctx_fmt%253Dinfo%253Aofi%252Ffmt%253Akev%253Amtx%253Actx [24]: /lookup/ijlink/YTozOntzOjQ6InBhdGgiO3M6MTQ6Ii9sb29rdXAvaWpsaW5rIjtzOjU6InF1ZXJ5IjthOjQ6e3M6ODoibGlua1R5cGUiO3M6NDoiQUJTVCI7czoxMToiam91cm5hbENvZGUiO3M6Mzoic2NpIjtzOjU6InJlc2lkIjtzOjEzOiIzNjIvNjQxOS8xMTQwIjtzOjQ6ImF0b20iO3M6MjM6Ii9zY2kvMzcwLzY1MjAvMTA0MC5hdG9tIjt9czo4OiJmcmFnbWVudCI7czowOiIiO30= [25]: #xref-ref-6-1 "View reference 6 in text" [26]: [27]: #xref-ref-7-1 "View reference 7 in text" [28]: [29]: #xref-ref-8-1 "View reference 8 in text" [30]: {openurl}?query=rft.jtitle%253DAm.%2BEcon.%2BRev.%26rft.volume%253D110%26rft.spage%253D3267%26rft.genre%253Darticle%26rft_val_fmt%253Dinfo%253Aofi%252Ffmt%253Akev%253Amtx%253Ajournal%26ctx_ver%253DZ39.88-2004%26url_ver%253DZ39.88-2004%26url_ctx_fmt%253Dinfo%253Aofi%252Ffmt%253Akev%253Amtx%253Actx [31]: #xref-ref-9-1 "View reference 9 in text" [32]: #xref-ref-10-1 "View reference 10 in text" [33]: #xref-ref-11-1 "View reference 11 in text" [34]: {openurl}?query=rft.jtitle%253DJ.%2BCompet.%2BLaw%2BEcon.%26rft.volume%253D14%26rft.spage%253D331%26rft.genre%253Darticle%26rft_val_fmt%253Dinfo%253Aofi%252Ffmt%253Akev%253Amtx%253Ajournal%26ctx_ver%253DZ39.88-2004%26url_ver%253DZ39.88-2004%26url_ctx_fmt%253Dinfo%253Aofi%252Ffmt%253Akev%253Amtx%253Actx [35]: #xref-ref-12-1 "View reference 12 in text" [36]: {openurl}?query=rft.jtitle%253DACM%2BQueue%26rft.volume%253D16%26rft.spage%253D30%26rft.genre%253Darticle%26rft_val_fmt%253Dinfo%253Aofi%252Ffmt%253Akev%253Amtx%253Ajournal%26ctx_ver%253DZ39.88-2004%26url_ver%253DZ39.88-2004%26url_ctx_fmt%253Dinfo%253Aofi%252Ffmt%253Akev%253Amtx%253Actx [37]: #xref-ref-13-1 "View reference 13 in text" [38]: {openurl}?query=rft.jtitle%253DScience%26rft.stitle%253DScience%26rft.aulast%253DThomas%26rft.auinit1%253DP.%2BS.%26rft.volume%253D366%26rft.issue%253D6468%26rft.spage%253D999%26rft.epage%253D1004%26rft.atitle%253DPreventing%2Bundesirable%2Bbehavior%2Bof%2Bintelligent%2Bmachines%26rft_id%253Dinfo%253Adoi%252F10.1126%252Fscience.aag3311%26rft_id%253Dinfo%253Apmid%252F31754000%26rft.genre%253Darticle%26rft_val_fmt%253Dinfo%253Aofi%252Ffmt%253Akev%253Amtx%253Ajournal%26ctx_ver%253DZ39.88-2004%26url_ver%253DZ39.88-2004%26url_ctx_fmt%253Dinfo%253Aofi%252Ffmt%253Akev%253Amtx%253Actx [39]: /lookup/ijlink/YTozOntzOjQ6InBhdGgiO3M6MTQ6Ii9sb29rdXAvaWpsaW5rIjtzOjU6InF1ZXJ5IjthOjQ6e3M6ODoibGlua1R5cGUiO3M6NDoiQUJTVCI7czoxMToiam91cm5hbENvZGUiO3M6Mzoic2NpIjtzOjU6InJlc2lkIjtzOjEyOiIzNjYvNjQ2OC85OTkiO3M6NDoiYXRvbSI7czoyMzoiL3NjaS8zNzAvNjUyMC8xMDQwLmF0b20iO31zOjg6ImZyYWdtZW50IjtzOjA6IiI7fQ== [40]: #xref-ref-14-1 "View reference 14 in text" [41]: #xref-ref-15-1 "View reference 15 in text"

Reinforcement Learning for Robust Missile Autopilot Design Artificial Intelligence

Designing missiles' autopilot controllers has been a complex task, given the extensive flight envelope and the nonlinear flight dynamics. A solution that can excel both in nominal performance and in robustness to uncertainties is still to be found. While Control Theory often debouches into parameters' scheduling procedures, Reinforcement Learning has presented interesting results in ever more complex tasks, going from videogames to robotic tasks with continuous action domains. However, it still lacks clearer insights on how to find adequate reward functions and exploration strategies. To the best of our knowledge, this work is pioneer in proposing Reinforcement Learning as a framework for flight control. In fact, it aims at training a model-free agent that can control the longitudinal flight of a missile, achieving optimal performance and robustness to uncertainties. To that end, under TRPO's methodology, the collected experience is augmented according to HER, stored in a replay buffer and sampled according to its significance. Not only does this work enhance the concept of prioritized experience replay into BPER, but it also reformulates HER, activating them both only when the training progress converges to suboptimal policies, in what is proposed as the SER methodology. Besides, the Reward Engineering process is carefully detailed. The results show that it is possible both to achieve the optimal performance and to improve the agent's robustness to uncertainties (with low damage on nominal performance) by further training it in non-nominal environments, therefore validating the proposed approach and encouraging future research in this field.