Learning Nonlinear Causal Reductions to Explain Reinforcement Learning Policies

Kekić, Armin, Schneider, Jan, Büchler, Dieter, Schölkopf, Bernhard, Besserve, Michel

Jul-22-2025–arXiv.org Machine Learning

Why do reinforcement learning (RL) policies fail or succeed? This is a challenging question due to the complex, high-dimensional nature of agent-environment interactions. In this work, we take a causal perspective on explaining the behavior of RL policies by viewing the states, actions, and rewards as variables in a low-level causal model. We introduce random perturbations to policy actions during execution and observe their effects on the cumulative reward, learning a simplified high-level causal model that explains these relationships. To this end, we develop a nonlinear Causal Model Reduction framework that ensures approximate interventional consistency, meaning the simplified high-level model responds to interventions in a similar way as the original complex system. We prove that for a class of nonlinear causal models, there exists a unique solution that achieves exact interventional consistency, ensuring learned explanations reflect meaningful causal patterns. Experiments on both synthetic causal models and practical RL tasks-including pendulum control and robot table tennis-demonstrate that our approach can uncover important behavioral patterns, biases, and failure modes in trained RL policies.

intervention, machine learning, reinforcement learning, (15 more...)

arXiv.org Machine Learning

Jul-22-2025

arXiv.org PDF

Add feedback

Country:
- North America > Canada
  - Alberta (0.14)
- Europe
  - United Kingdom > England
    - Cambridgeshire > Cambridge (0.04)
  - Germany > Baden-Württemberg
    - Tübingen Region > Tübingen (0.04)
- Asia > Middle East
  - Jordan (0.04)

Genre:
- Research Report (0.82)

Industry:
- Leisure & Entertainment > Sports (0.56)

Technology:
- Information Technology > Artificial Intelligence
  - Representation & Reasoning (1.00)
  - Machine Learning
    - Statistical Learning (1.00)
    - Reinforcement Learning (1.00)
    - Neural Networks (0.93)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found