Goto

Collaborating Authors

 Wu, I-Chen


Learning Sim-to-Real Dense Object Descriptors for Robotic Manipulation

arXiv.org Artificial Intelligence

It is crucial to address the following issues for ubiquitous robotics manipulation applications: (a) vision-based manipulation tasks require the robot to visually learn and understand the object with rich information like dense object descriptors; and (b) sim-to-real transfer in robotics aims to close the gap between simulated and real data. In this paper, we present Sim-to-Real Dense Object Nets (SRDONs), a dense object descriptor that not only understands the object via appropriate representation but also maps simulated and real data to a unified feature space with pixel consistency. We proposed an object-to-object matching method for image pairs from different scenes and different domains. This method helps reduce the effort of training data from real-world by taking advantage of public datasets, such as GraspNet. With sim-to-real object representation consistency, our SRDONs can serve as a building block for a variety of sim-to-real manipulation tasks. We demonstrate in experiments that pre-trained SRDONs significantly improve performances on unseen objects and unseen visual environments for various robotic tasks with zero real-world training.


A Local-Pattern Related Look-Up Table

arXiv.org Artificial Intelligence

This paper describes a Relevance-Zone pattern table (RZT) that can be used to replace a traditional transposition table. An RZT stores exact game values for patterns that are discovered during a Relevance-Zone-Based Search (RZS), which is the current state-of-the-art in solving L&D problems in Go. Positions that share the same pattern can reuse the same exact game value in the RZT. The pattern matching scheme for RZTs is implemented using a radix tree, taking into consideration patterns with different shapes. To improve the efficiency of table lookups, we designed a heuristic that prevents redundant lookups. The heuristic can safely skip previously queried patterns for a given position, reducing the overhead to 10% of the original cost. We also analyze the time complexity of the RZT both theoretically and empirically. Experiments show the overhead of traversing the radix tree in practice during lookup remain flat logarithmically in relation to the number of entries stored in the table. Experiments also show that the use of an RZT instead of a traditional transposition table significantly reduces the number of searched nodes on two data sets of 7x7 and 19x19 L&D Go problems.


A Novel Approach to Solving Goal-Achieving Problems for Board Games

arXiv.org Artificial Intelligence

Goal-achieving problems are puzzles that set up a specific situation with a clear objective. An example that is well-studied is the category of life-and-death (L&D) problems for Go, which helps players hone their skill of identifying region safety. Many previous methods like lambda search try null moves first, then derive so-called relevance zones (RZs), outside of which the opponent does not need to search. This paper first proposes a novel RZ-based approach, called the RZ-Based Search (RZS), to solving L&D problems for Go. RZS tries moves before determining whether they are null moves post-hoc. This means we do not need to rely on null move heuristics, resulting in a more elegant algorithm, so that it can also be seamlessly incorporated into AlphaZero's super-human level play in our solver. To repurpose AlphaZero for solving, we also propose a new training method called Faster to Life (FTL), which modifies AlphaZero to entice it to win more quickly. We use RZS and FTL to solve L&D problems on Go, namely solving 68 among 106 problems from a professional L&D book while a previous program solves 11 only. Finally, we discuss that the approach is generic in the sense that RZS is applicable to solving many other goal-achieving problems for board games.


Optimistic Temporal Difference Learning for 2048

arXiv.org Artificial Intelligence

Temporal difference (TD) learning and its variants, such as multistage TD (MS-TD) learning and temporal coherence (TC) learning, have been successfully applied to 2048. These methods rely on the stochasticity of the environment of 2048 for exploration. In this paper, we propose to employ optimistic initialization (OI) to encourage exploration for 2048, and empirically show that the learning quality is significantly improved. This approach optimistically initializes the feature weights to very large values. Since weights tend to be reduced once the states are visited, agents tend to explore those states which are unvisited or visited few times. Our experiments show that both TD and TC learning with OI significantly improve the performance. As a result, the network size required to achieve the same performance is significantly reduced. With additional tunings such as expectimax search, multistage learning, and tile-downgrading technique, our design achieves the state-of-the-art performance, namely an average score of 625 377 and a rate of 72% reaching 32768 tiles. In addition, for sufficiently large tests, 65536 tiles are reached at a rate of 0.02%.


An Unsupervised Video Game Playstyle Metric via State Discretization

arXiv.org Artificial Intelligence

On playing video games, different players usually have their own playstyles. Recently, there have been great improvements for the video game AIs on the playing strength. However, past researches for analyzing the behaviors of players still used heuristic rules or the behavior features with the game-environment support, thus being exhausted for the developers to define the features of discriminating various playstyles. In this paper, we propose the first metric for video game playstyles directly from the game observations and actions, without any prior specification on the playstyle in the target game. Our proposed method is built upon a novel scheme of learning discrete representations that can map game observations into latent discrete states, such that playstyles can be exhibited from these discrete states. Namely, we measure the playstyle distance based on game observations aligned to the same states. We demonstrate high playstyle accuracy of our metric in experiments on some video game platforms, including TORCS, RGSK, and seven Atari games, and for different agents including rule-based AI bots, learning-based AI bots, and human players.


Learning to Stop: Dynamic Simulation Monte-Carlo Tree Search

arXiv.org Artificial Intelligence

Monte Carlo tree search (MCTS) has achieved state-of-the-art results in many domains such as Go and Atari games when combining with deep neural networks (DNNs). When more simulations are executed, MCTS can achieve higher performance but also requires enormous amounts of CPU and GPU resources. However, not all states require a long searching time to identify the best action that the agent can find. For example, in 19x19 Go and NoGo, we found that for more than half of the states, the best action predicted by DNN remains unchanged even after searching 2 minutes. This implies that a significant amount of resources can be saved if we are able to stop the searching earlier when we are confident with the current searching result. In this paper, we propose to achieve this goal by predicting the uncertainty of the current searching status and use the result to decide whether we should stop searching. With our algorithm, called Dynamic Simulation MCTS (DS-MCTS), we can speed up a NoGo agent trained by AlphaZero 2.5 times faster while maintaining a similar winning rate. Also, under the same average simulation count, our method can achieve a 61% winning rate against the original program.


Sim-To-Real Transfer for Miniature Autonomous Car Racing

arXiv.org Artificial Intelligence

Sim-to-real, a term that describes where a model is trained in a simulator then transferred to the real world, is a technique that enables faster deep reinforcement learning (DRL) training. However, differences between the simulator and the real world often cause the model to perform poorly in the real world. Domain randomization is a way to bridge the sim-to-real gap by exposing the model to a wide range of scenarios so that it can generalize to real-world situations. However, following domain randomization to train an autonomous car racing model with DRL can lead to undesirable outcomes. Namely, a model trained with randomization tends to run slower; a higher completion rate on the testing track comes at the expense of longer lap times. This paper aims to boost the robustness of a trained race car model without compromising racing lap times. For a training track and a testing track having the same shape (and same optimal paths), but with different lighting, background, etc., we first train a model (teacher model) that overfits the training track, moving along a near optimal path. We then use this model to teach a student model the correct actions along with randomization. With our method, a model with 18.4\% completion rate on the testing track is able to help teach a student model with 52\% completion. Moreover, over an average of 50 trials, the student is able to finish a lap 0.23 seconds faster than the teacher. This 0.23 second gap is significant in tight races, with lap times of about 10 to 12 seconds.


Multiple Policy Value Monte Carlo Tree Search

arXiv.org Artificial Intelligence

Many of the strongest game playing programs use a combination of Monte Carlo tree search (MCTS) and deep neural networks (DNN), where the DNNs are used as policy or value evaluators. Given a limited budget, such as online playing or during the self-play phase of AlphaZero (AZ) training, a balance needs to be reached between accurate state estimation and more MCTS simulations, both of which are critical for a strong game playing agent. Typically, larger DNNs are better at generalization and accurate evaluation, while smaller DNNs are less costly, and therefore can lead to more MCTS simulations and bigger search trees with the same budget. This paper introduces a new method called the multiple policy value MCTS (MPV-MCTS), which combines multiple policy value neural networks (PV-NNs) of various sizes to retain advantages of each network, where two PV-NNs f_S and f_L are used in this paper. We show through experiments on the game NoGo that a combined f_S and f_L MPV-MCTS outperforms single PV-NN with policy value MCTS, called PV-MCTS. Additionally, MPV-MCTS also outperforms PV-MCTS for AZ training.


Towards Combining On-Off-Policy Methods for Real-World Applications

arXiv.org Machine Learning

In this paper, we point out a fundamental property of the objective in reinforcement learning, with which we can reformulate the policy gradient objective into a perceptron-like loss function, removing the need to distinguish between on and off policy training. Namely, we posit that it is sufficient to only update a policy $\pi$ for cases that satisfy the condition $A(\frac{\pi}{\mu}-1)\leq0$, where $A$ is the advantage, and $\mu$ is another policy. Furthermore, we show via theoretic derivation that a perceptron-like loss function matches the clipped surrogate objective for PPO. With our new formulation, the policies $\pi$ and $\mu$ can be arbitrarily apart in theory, effectively enabling off-policy training. To examine our derivations, we can combine the on-policy PPO clipped surrogate (which we show to be equivalent with one instance of the new reformation) with the off-policy IMPALA method. We first verify the combined method on the OpenAI Gym pendulum toy problem. Next, we use our method to train a quadrotor position controller in a simulator. Our trained policy is efficient and lightweight enough to perform in a low cost micro-controller at a minimum update rate of 500 Hz. For the quadrotor, we show two experiments to verify our method and demonstrate performance: 1) hovering at a fixed position, and 2) tracking along a specific trajectory. In preliminary trials, we are also able to apply the method to a real-world quadrotor.


Human vs. Computer Go: Review and Prospect

arXiv.org Artificial Intelligence

The Google DeepMind challenge match in March 2016 was a historic achievement for computer Go development. This article discusses the development of computational intelligence (CI) and its relative strength in comparison with human intelligence for the game of Go. We first summarize the milestones achieved for computer Go from 1998 to 2016. Then, the computer Go programs that have participated in previous IEEE CIS competitions as well as methods and techniques used in AlphaGo are briefly introduced. Commentaries from three high-level professional Go players on the five AlphaGo versus Lee Sedol games are also included. We conclude that AlphaGo beating Lee Sedol is a huge achievement in artificial intelligence (AI) based largely on CI methods. In the future, powerful computer Go programs such as AlphaGo are expected to be instrumental in promoting Go education and AI real-world applications.