trpo
Details
The training is stalled if the size of the replay buffer is smaller than the minibatch size, i.e., if |B|< M. Algorithms 3 and 4 show the critic network update and the actor network and uncertainty parameter sampler update, respectively. Although we write the gradient-based update in the form of a mini-batch stochastic gradient update for simplicity, we employ an adaptive approach such as Adam [16]. The update of pk follows the exponential moving average with the momentum (1/Tlast), where Tlast is the number of steps spent in the last episode (Tlast is set to 1000 for the first episode). The reason behind this design choice is as follows. The short episode is a meaning that a bad uncertainty parameter ฯ is used in the last episode.
Evaluation of a Robust Control System in Real-World Cable-Driven Parallel Robots
Nurtdinov, Damir, Korshuk, Aliaksei, Kornaev, Alexei, Maloletov, Alexander
This study evaluates the performance of classical and modern control methods for real-world Cable-Driven Parallel Robots (CDPRs), focusing on underconstrained systems with limited time discretization. A comparative analysis is conducted between classical PID controllers and modern reinforcement learning algorithms, including Deep Deterministic Policy Gradient (DDPG), Proximal Policy Optimization (PPO), and Trust Region Policy Optimization (TRPO). The results demonstrate that TRPO outperforms other methods, achieving the lowest root mean square (RMS) errors across various trajectories and exhibiting robustness to larger time intervals between control updates. TRPO's ability to balance exploration and exploitation enables stable control in noisy, real-world environments, reducing reliance on high-frequency sensor feedback and computational demands. Cable-Driven Parallel Robots (CDPR) have unique parameters, which means they can move heavy loads within a fairly large space.
743c41a921516b04afde48bb48e28ce6-AuthorFeedback.pdf
HOOF is robust to settings within this range. We could not present results for Ant and Walker due to space constraints. Thus we are restricted to zero order optimisers. For natural gradients like TNPG, HOOF does not add any new hyperparameters beyond those used by grid search - i.e. Other methods like PBT introduce more hyperparameters than these.
Efficient On-Policy Reinforcement Learning via Exploration of Sparse Parameter Space
Zhang, Xinyu, Deb, Aishik, Mueller, Klaus
Policy-gradient methods such as Proximal Policy Optimization (PPO) are typically updated along a single stochastic gradient direction, leaving the rich local structure of the parameter space unexplored. Previous work has shown that the surrogate gradient is often poorly correlated with the true reward landscape. Building on this insight, we visualize the parameter space spanned by policy checkpoints within an iteration and reveal that higher performing solutions often lie in nearby unexplored regions. To exploit this opportunity, we introduce ExploRLer, a pluggable pipeline that seamlessly integrates with on-policy algorithms such as PPO and TRPO, systematically probing the unexplored neighborhoods of surrogate on-policy gradient updates. Without increasing the number of gradient updates, ExploRLer achieves significant improvements over baselines in complex continuous control environments. Our results demonstrate that iteration-level exploration provides a practical and effective way to strengthen on-policy reinforcement learning and offer a fresh perspective on the limitations of the surrogate objective.