Country
Reinforcement Learning for Agents with Many Sensors and Actuators Acting in Categorizable Environments
In this paper, we confront the problem of applying reinforcement learning to agents that perceive the environment through many sensors and that can perform parallel actions using many actuators as is the case in complex autonomous robots. We argue that reinforcement learning can only be successfully applied to this case if strong assumptions are made on the characteristics of the environment in which the learning is performed, so that the relevant sensor readings and motor commands can be readily identified. The introduction of such assumptions leads to strongly-biased learning systems that can eventually lose the generality of traditional reinforcement-learning algorithms. In this line, we observe that, in realistic situations, the reward received by the robot depends only on a reduced subset of all the executed actions and that only a reduced subset of the sensor inputs (possibly different in each situation and for each action) are relevant to predict the reward. We formalize this property in the so called 'categorizability assumption' and we present an algorithm that takes advantage of the categorizability of the environment, allowing a decrease in the learning time with respect to existing reinforcement-learning algorithms. Results of the application of the algorithm to a couple of simulated realistic-robotic problems (landmark-based navigation and the six-legged robot gait generation) are reported to validate our approach and to compare it to existing flat and generalization-based reinforcement-learning approaches.
Decentralized Control of Cooperative Systems: Categorization and Complexity Analysis
Goldman, C. V., Zilberstein, S.
The difficulty in solving optimally such problems arises when the agents lack full observability of the global state of the system when they operate. The general problem has been shown to be NEXP-complete. In this paper, we identify classes of decentralized control problems whose complexity ranges between NEXP and P. In particular, we study problems characterized by independent transitions, independent observations, and goal-oriented objective functions. Two algorithms are shown to solve optimally useful classes of goal-oriented decentralized processes in polynomial time. This paper also studies information sharing among the decision-makers, which can improve their performance. We distinguish between three ways in which agents can exchange information: indirect communication, direct communication and sharing state features that are not controlled by the agents. Our analysis shows that for every class of problems we consider, introducing direct or indirect communication does not change the worst-case complexity. The results provide a better understanding of the complexity of decentralized control problems that arise in practice and facilitate the development of planning algorithms for these problems.
Explicit Learning Curves for Transduction and Application to Clustering and Compression Algorithms
Derbeko, P., El-Yaniv, R., Meir, R.
Inductive learning is based on inferring a general rule from a finite data set and using it to label new data. In transduction one attempts to solve the problem of using a labeled training set to label a set of unlabeled points, which are given to the learner prior to learning. Although transduction seems at the outset to be an easier task than induction, there have not been many provably useful algorithms for transduction. Moreover, the precise relation between induction and transduction has not yet been determined. The main theoretical developments related to transduction were presented by Vapnik more than twenty years ago. One of Vapnik's basic results is a rather tight error bound for transductive classification based on an exact computation of the hypergeometric tail. While tight, this bound is given implicitly via a computational routine. Our first contribution is a somewhat looser but explicit characterization of a slightly extended PAC-Bayesian version of Vapnik's transductive bound. This characterization is obtained using concentration inequalities for the tail of sums of random variables obtained by sampling without replacement. We then derive error bounds for compression schemes such as (transductive) support vector machines and for transduction algorithms based on clustering. The main observation used for deriving these new error bounds and algorithms is that the unlabeled test points, which in the transductive setting are known in advance, can be used in order to construct useful data dependent prior distributions over the hypothesis space.
Graduality in Argumentation
Cayrol, C., Lagasquie-Schiex, M. C.
Argumentation is based on the exchange and valuation of interacting arguments, followed by the selection of the most acceptable of them (for example, in order to take a decision, to make a choice). Starting from the framework proposed by Dung in 1995, our purpose is to introduce 'graduality' in the selection of the best arguments, i.e., to be able to partition the set of the arguments in more than the two usual subsets of 'selected' and 'non-selected' arguments in order to represent different levels of selection. Our basic idea is that an argument is all the more acceptable if it can be preferred to its attackers. First, we discuss general principles underlying a 'gradual' valuation of arguments based on their interactions. Following these principles, we define several valuation models for an abstract argumentation system. Then, we introduce 'graduality' in the concept of acceptability of arguments. We propose new acceptability classes and a refinement of existing classes taking advantage of an available 'gradual' valuation.
Towards Understanding and Harnessing the Potential of Clause Learning
Beame, P., Kautz, H., Sabharwal, A.
Efficient implementations of DPLL with the addition of clause learning are the fastest complete Boolean satisfiability solvers and can handle many significant real-world problems, such as verification, planning and design. Despite its importance, little is known of the ultimate strengths and limitations of the technique. This paper presents the first precise characterization of clause learning as a proof system (CL), and begins the task of understanding its power by relating it to the well-studied resolution proof system. In particular, we show that with a new learning scheme, CL can provide exponentially shorter proofs than many proper refinements of general resolution (RES) satisfying a natural property. These include regular and Davis-Putnam resolution, which are already known to be much stronger than ordinary DPLL. We also show that a slight variant of CL with unlimited restarts is as powerful as RES itself. Translating these analytical results to practice, however, presents a challenge because of the nondeterministic nature of clause learning algorithms. We propose a novel way of exploiting the underlying problem structure, in the form of a high level problem description such as a graph or PDDL specification, to guide clause learning algorithms toward faster solutions. We show that this leads to exponential speed-ups on grid and randomized pebbling problems, as well as substantial improvements on certain ordering formulas.
A Maximal Tractable Class of Soft Constraints
Cohen, D., Cooper, M., Jeavons, P., Krokhin, A.
Many researchers in artificial intelligence are beginning to explore the use of soft constraints to express a set of (possibly conflicting) problem requirements. A soft constraint is a function defined on a collection of variables which associates some measure of desirability with each possible combination of values for those variables. However, the crucial question of the computational complexity of finding the optimal solution to a collection of soft constraints has so far received very little attention. In this paper we identify a class of soft binary constraints for which the problem of finding the optimal solution is tractable. In other words, we show that for any given set of such constraints, there exists a polynomial time algorithm to determine the assignment having the best overall combined measure of desirability. This tractable class includes many commonly-occurring soft constraints, such as 'as near as possible' or 'as soon as possible after', as well as crisp constraints such as 'greater than'. Finally, we show that this tractable class is maximal, in the sense that adding any other form of soft binary constraint which is not in the class gives rise to a class of problems which is NP-hard.
Restricted Value Iteration: Theory and Algorithms
Value iteration is a popular algorithm for finding near optimal policies for POMDPs. It is inefficient due to the need to account for the entire belief space, which necessitates the solution of large numbers of linear programs. In this paper, we study value iteration restricted to belief subsets. We show that, together with properly chosen belief subsets, restricted value iteration yields near-optimal policies and we give a condition for determining whether a given belief subset would bring about savings in space and time. We also apply restricted value iteration to two interesting classes of POMDPs, namely informative POMDPs and near-discernible POMDPs.
PHA*: Finding the Shortest Path with A* in An Unknown Physical Environment
Ben-Yair, A., Felner, A., Kraus, S., Netanyahu, N., Stern, R.
We address the problem of finding the shortest path between two points in an unknown real physical environment, where a traveling agent must move around in the environment to explore unknown territory. We introduce the Physical-A* algorithm (PHA*) for solving this problem. PHA* expands all the mandatory nodes that A* would expand and returns the shortest path between the two points. However, due to the physical nature of the problem, the complexity of the algorithm is measured by the traveling effort of the moving agent and not by the number of generated nodes, as in standard A*. PHA* is presented as a two-level algorithm, such that its high level, A*, chooses the next node to be expanded and its low level directs the agent to that node in order to explore it. We present a number of variations for both the high-level and low-level procedures and evaluate their performance theoretically and experimentally. We show that the travel cost of our best variation is fairly close to the optimal travel cost, assuming that the mandatory nodes of A* are known in advance. We then generalize our algorithm to the multi-agent case, where a number of cooperative agents are designed to solve the problem. Specifically, we provide an experimental implementation for such a system. It should be noted that the problem addressed here is not a navigation problem, but rather a problem of finding the shortest path between two points for future usage.
Can We Learn to Beat the Best Stock
Borodin, A., El-Yaniv, R., Gogan, V.
A novel algorithm for actively trading stocks is presented. While traditional expert advice and "universal" algorithms (as well as standard technical trading heuristics) attempt to predict winners or trends, our approach relies on predictable statistical relations between all pairs of stocks in the market. Our empirical results on historical markets provide strong evidence that this type of technical trading can "beat the market" and moreover, can beat the best stock in the market. In doing so we utilize a new idea for smoothing critical parameters in the context of expert learning.
Price Prediction in a Trading Agent Competition
Lochner, K. M., Reeves, D. M., Vorobeychik, Y., Wellman, M. P.
The 2002 Trading Agent Competition (TAC) presented a challenging market game in the domain of travel shopping. One of the pivotal issues in this domain is uncertainty about hotel prices, which have a significant influence on the relative cost of alternative trip schedules. Thus, virtually all participants employ some method for predicting hotel prices. We survey approaches employed in the tournament, finding that agents apply an interesting diversity of techniques, taking into account differing sources of evidence bearing on prices. Based on data provided by entrants on their agents' actual predictions in the TAC-02 finals and semifinals, we analyze the relative efficacy of these approaches. The results show that taking into account game-specific information about flight prices is a major distinguishing factor. Machine learning methods effectively induce the relationship between flight and hotel prices from game data, and a purely analytical approach based on competitive equilibrium analysis achieves equal accuracy with no historical data. Employing a new measure of prediction quality, we relate absolute accuracy to bottom-line performance in the game.