2b38c2df6a49b97f706ec9148ce48d86-Supplemental.pdf

Feb-8-2026, 00:34:04 GMT–Neural Information Processing Systems

In this section, we further clarify why a naive application of data augmentation with certain RL algorithms is theoretically unsound. The correct estimate of the policy gradient objective used in PPO is the one in equation(1) (or equivalently, equation(8)) which does not use the augmented observations at all since we are estimating advantages for the actual observations,A(s,a). The probability distribution used to sample advantages isπold(a|s)(rather thanπold(a|f(s))since we can only interact with the environment via the true observations and not the augmented ones (because the reward and transition functions are not defined for augmented observations). Hence, the correct importance sampling estimate usesπ(a|s)/πold(a|s). Usingπ(a|f(s))/πold(a|f(s))instead would be incorrect for the reasons mentioned above. In contrast, DrAC does not change the policy gradient objective at all which remains the one in equation(1)andinsteadusestheaugmented observationsintheadditional regularizationlosses,as showninequations (3), (4),and (5). Note that this cycle-consistency implies that two trajectories are accurately aligned in the hidden space.

artificial intelligence, equation, machine learning, (17 more...)

Neural Information Processing Systems

Feb-8-2026, 00:34:04 GMT

Conferences PDF

Add feedback

Technology:
- Information Technology > Artificial Intelligence > Machine Learning (0.67)

Duplicate Docs Excel Report

Title
2b38c2df6a49b97f706ec9148ce48d86-Supplemental.pdf

Similar Docs Excel Report more

Title	Similarity	Source
None found