Goto

Collaborating Authors

 Vrije Universiteit Brussel


Adapting to Concept Drift in Credit Card Transaction Data Streams Using Contextual Bandits and Decision Trees

AAAI Conferences

Credit card transactions predicted to be fraudulent by automated detection systems are typically handed over to human experts for verification. To limit costs, it is standard practice to select only the most suspicious transactions for investigation. We claim that a trade-off between exploration and exploitation is imperative to enable adaptation to changes in behavior (concept drift). Exploration consists of the selection and investigation of transactions with the purpose of improving predictive models, and exploitation consists of investigating transactions detected to be suspicious. Modeling the detection of fraudulent transactions as rewarding, we use an incremental Regression Tree learner to create clusters of transactions with similar expected rewards. This enables the use of a Contextual Multi-Armed Bandit (CMAB) algorithm to provide the exploration/exploitation trade-off. We introduce a novel variant of a CMAB algorithm that makes use of the structure of this tree, and use Semi-Supervised Learning to grow the tree using unlabeled data. The approach is evaluated on a real dataset and data generated by a simulator that adds concept drift by adapting the behavior of fraudsters to avoid detection. It outperforms frequently used offline models in terms of cumulative rewards, in particular in the presence of concept drift.


Learning With Options That Terminate Off-Policy

AAAI Conferences

A temporally abstract action, or an option, is specified by a policy and a termination condition: the policy guides the option behavior, and the termination condition roughly determines its length. Generally, learning with longer options (like learning with multi-step returns) is known to be more efficient. However, if the option set for the task is not ideal, and cannot express the primitive optimal policy well, shorter options offer more flexibility and can yield a better solution. Thus, the termination condition puts learning efficiency at odds with solution quality. We propose to resolve this dilemma by decoupling the behavior and target terminations, just like it is done with policies in off-policy learning. To this end, we give a new algorithm, Q(beta), that learns the solution with respect to any termination condition, regardless of how the options actually terminate. We derive Q(beta) by casting learning with options into a common framework with well-studied multi-step off policy learning. We validate our algorithm empirically, and show that it holds up to its motivating claims.


Coordinating Human and Agent Behavior in Collective-Risk Scenarios

AAAI Conferences

Various social situations entail a collective risk. A well-known example is climate change, wherein the risk of a future environmental disaster clashes with the immediate economic interest of developed and developing countries. The collective-risk game operationalizes this kind of situations. The decision process of the participants is determined by how good they are in evaluating the probability of future risk as well as their ability to anticipate the actions of the opponents. Anticipatory behavior contrasts with the reactive theories often used to analyze social dilemmas. Our initial work can already show that anticipative agents are a better model to human behavior than reactive ones. All the agents we studied used a recurrent neural network, however, only the ones that used it to predict future outcomes (anticipative agents) were able to account for changes in the context of games, a behavior also observed in experiments with humans. This extended abstract aims to explain how we wish to investigate anticipation within the context of the collective-risk game and the relevance these results may have for the field of hybrid socio-technical systems.


Reactive Versus Anticipative Decision Making in a Novel Gift-Giving Game

AAAI Conferences

Evolutionary game theory focuses on the fitness differences between simple discrete or probabilistic strategies to explain the evolution of particular decision-making behavior within strategic situations. Although this approach has provided substantial insights into the presence of fairness or generosity in gift-giving games, it does not fully resolve the question of which cognitive mechanisms are required to produce the choices observed in experiments. One such mechanism that humans have acquired, is the capacity to anticipate. Prior work showed that forward-looking behavior, using a recurrent neural network to model the cognitive mechanism, are essential to produce the actions of human participants in behavioral experiments. In this paper, we evaluate whether this conclusion extends also to gift-giving games, more concretely, to a game that combines the dictator game with a partner selection process. The recurrent neural network model used here for dictators, allows them to reason about a best response to past actions of the receivers (reactive model) or to decide which action will lead to a more successful outcome in the future (anticipatory model). We show for both models the decision dynamics while training, as well as the average behavior. We find that the anticipatory model is the only one capable of accounting for changes in the context of the game, a behavior also observed in experiments, expanding previous conclusions to this more sophisticated game.


Dimensionality Reduced Reinforcement Learning for Assistive Robots

AAAI Conferences

State-of-the-art personal robots need to perform complex manipulation tasks to be viable in assistive scenarios. However, many of these robots, like the PR2, use manipulators with high degrees-of-freedom, and the problem is made worse in bimanual manipulation tasks. The complexity of these robots lead to large dimensional state spaces, which are difficult to learn in. We reduce the state space by using demonstrations to discover a representative low-dimensional hyperplane in which to learn. This allows the agent to converge quickly to a good policy. We call this Dimensionality Reduced Reinforcement Learning (DRRL). However, when performing dimensionality reduction, not all dimensions can be fully represented. We extend this work by first learning in a single dimension, and then transferring that knowledge to a higher-dimensional hyperplane. By using our Iterative DRRL (IDRRL) framework with an existing learning algorithm, the agent converges quickly to a better policy by iterating to increasingly higher dimensions. IDRRL is robust to demonstration quality and can learn efficiently using few demonstrations. We show that adding IDRRL to the Q-Learning algorithm leads to faster learning on a set of mountain car tasks and the robot swimmers problem.


Guilt for Non-Humans

AAAI Conferences

We know too that guilt may be alleviated by private confession Theorists conceive of shame and guilt as belonging to the (namely to a priest or a psychotherapist) plus the family of self-conscious emotions (Lewis 1990) (Fischer renouncing of past failings in future. Because of their private and Tangney 1995) (Tangney and Dearing 2002), invoked character, such confessions and atonements, given through self-reflection and self-evaluation. Though both their cost (prayers or fees), render temptation defecting less have evolved to promote cooperation, guilt and shame can probable. Public or open confession of guilt can be coordinated be treated separately. Guilt is an inward private phenomenon, with apology for better effect, and the cost appertained though it can promote apology, and even spontaneous to some common good (like charity), or as individual public confession. Shame is inherently public, though it compensation to injured parties.


Conditions for the Evolution of Apology and Forgiveness in Populations of Autonomous Agents

AAAI Conferences

We report here on our previous research on the evolution of commitment behaviour in the one-off and iterated prisoner's dilemma and relate it to the issue of designing non-human autonomous online systems. We show that it was necessary to introduce an apology/forgiveness mechanism in the iterated case since without this restorative mechanism strategies evolve that take revenge when the agreement fails. As before in online interaction systems, apology and forgiveness seem to provide important mechanisms to repair trust. As such, these result provide, next to the insight into our own moral and ethical considerations, ideas into how (and also why) similar mechanisms can be designed into the repertoire of actions that can be taken by non-human autonomous agents.


Expressing Arbitrary Reward Functions as Potential-Based Advice

AAAI Conferences

Effectively incorporating external advice is an important problem in reinforcement learning, especially as it moves into the real world. Potential-based reward shaping is a way to provide the agent with a specific form of additional reward, with the guarantee of policy invariance. In this work we give a novel way to incorporate an arbitrary reward function with the same guarantee, by implicitly translating it into the specific form of dynamic advice potentials, which are maintained as an auxiliary value function learnt at the same time. We show that advice provided in this way captures the input reward function in expectation, and demonstrate its efficacy empirically.


Combining Multiple Correlated Reward and Shaping Signals by Measuring Confidence

AAAI Conferences

Multi-objective problems with correlated objectives are a class of problems that deserve specific attention. In contrast to typical multi-objective problems, they do not require the identification of trade-offs between the objectives, as (near-) optimal solutions for any objective are (near-) optimal for every objective. Intelligently combining the feedback from these objectives, instead of only looking at a single one, can improve optimization. This class of problems is very relevant in reinforcement learning, as any single-objective reinforcement learning problem can be framed as such a multi-objective problem using multiple reward shaping functions. After discussing this problem class, we propose a solution technique for such reinforcement learning problems, called adaptive objective selection. This technique makes a temporal difference learner estimate the Q-function for each objective in parallel, and introduces a way of measuring confidence in these estimates. This confidence metric is then used to choose which objective's estimates to use for action selection. We show significant improvements in performance over other plausible techniques on two problem domains. Finally, we provide an intuitive analysis of the technique's decisions, yielding insights into the nature of the problems being solved.


Reinforcement Learning on Multiple Correlated Signals

AAAI Conferences

As potential-based reward shaping functions (heuristic signals conflicts may exist between objectives, there is in general guiding exploration) (Brys et al. 2014a). We prove that this a need to identify (a set of) tradeoff solutions. The set modification preserves the total order, and thus also optimality, of optimal, i.e. non-dominated, incomparable solutions is of policies, mainly relying on the results by Ng, Harada, called the Pareto-front. We identify multi-objective problems and Russell (1999). This insight - that any MDP can be with correlated objectives (CMOP) as a specific subclass framed as a CMOMDP - significantly increases the importance of multi-objective problems, defined to contain those of this problem class, as well as techniques developed MOPs whose Pareto-front is so limited that one can barely for it, as these could be used to solve regular single-objective speak of tradeoffs (Brys et al. 2014b). By consequence, MDPs faster and better, provided several meaningful shapings the system designer does not care about which of the very can be devised.