Statistical Inference in Reinforcement Learning: A Selective Survey
Thus, the observed data can be summarized into a sequence of "observation-action-reward" triplets ( O t, A t, R t) t 0. It is worth noting that the observation O t at each time step is not equivalent to the environment's state S t. Indeed, the state can be viewed as a special observation with the Markov property, and we will elaborate on the difference between the two later. Policies: The goal of RL is to learn an optimal policy π based on the observation-action-reward triplets to maximize the agent's cumulative reward. Mathematically, a policy is defined as a conditional probability distribution function mapping the agent's observed data history to the action space. It specifies the probability of the agent taking different actions at each time step. Below, we introduce three types of policies (see Figure 1(b) for a visualization of their relationships): (1) History-dependent policy: This is the most general form of policy. At each time t, we define H t as the set containing the current observation O t and all prior historical information (O i, A i, R i) i
Feb-22-2025
- Country:
- Asia > Middle East
- Jordan (0.04)
- Europe
- Portugal > Porto
- Porto (0.04)
- United Kingdom > England
- Cambridgeshire > Cambridge (0.04)
- Portugal > Porto
- North America > United States
- Washington > King County > Seattle (0.04)
- Asia > Middle East
- Genre:
- Overview (1.00)
- Research Report
- Experimental Study (0.47)
- New Finding (0.46)
- Industry:
- Health & Medicine > Therapeutic Area > Endocrinology > Diabetes (1.00)