Goto

Collaborating Authors

 vpg


hzi,zii i + Ea bVi zi,Ea bVi zi, =Ea h bVi 2 hzi,zii i + E σπi(a) b Vi Eσπi(a)[zi], E σπi(a) b V i,Eσπi(a)[zi ], =Ea h

Neural Information Processing Systems

Cov[gi(s,a),gj(s,a)]. (9) The n optimal baselinesare given by the values that minimise Equation 9; i.e.b?i(s, σπi (a)) . Note that whileyi depends on the full action,xi depends only on the actions influencing the targets in [KΣψ(s,a)]i. Ingeneral,thereareveryfewmethods that can solve these type of systems, and those that can are limited to bounds of approximately d|Σ| 20. These explore the impact of the factor baseline across aset of dimensionalities and learning rates. This implies that the performance observed in the search bandit itlikely totell usabout the performance infull MDPs.


Improving Value Estimation Critically Enhances Vanilla Policy Gradient

arXiv.org Artificial Intelligence

Modern policy gradient algorithms, such as TRPO and PPO, outperform vanilla policy gradient in many RL tasks. Questioning the common belief that enforcing approximate trust regions leads to steady policy improvement in practice, we show that the more critical factor is the enhanced value estimation accuracy from more value update steps in each iteration. To demonstrate, we show that by simply increasing the number of value update steps per iteration, vanilla policy gradient itself can achieve performance comparable to or better than PPO in all the standard continuous control benchmark environments. Importantly, this simple change to vanilla policy gradient is significantly more robust to hyperparameter choices, opening up the possibility that RL algorithms may still become more effective and easier to use.


VPGTrans: Transfer Visual Prompt Generator across LLMs

arXiv.org Artificial Intelligence

While developing a new multimodal LLM (MLLM) by pre-training on tremendous image-text pairs from scratch can be exceedingly resource-consuming, connecting an existing LLM with a comparatively lightweight visual prompt generator (VPG) becomes a feasible paradigm. However, further tuning the VPG part of the MLLM still suffers from indispensable computational costs, i.e., requiring thousands of GPU hours and millions of training data. One alternative solution is to transfer an existing VPG from any existing MLLMs for the target MLLM. In this work, we for the first time investigate the VPG transferability across LLMs, and explore a solution to reduce the cost of VPG transfer. We first study the VPG transfer across different LLM sizes (e.g., small-to-large), and across different LLM types, through which we diagnose the key factors to maximize the transfer efficiency. Based on our observation, we design a two-stage transfer framework named VPGTrans, which is simple yet highly effective. Through extensive experiments, we demonstrate that VPGTrans helps significantly speed up the transfer learning process without compromising performance. Remarkably, it helps achieve the VPG transfer from BLIP-2 OPT$_\text{2.7B}$ to BLIP-2 OPT$_\text{6.7B}$ with over 10 times speed-up and 10.7% training data compared with connecting a VPG to OPT$_\text{6.7B}$ from scratch. Further, a series of intriguing findings and potential rationales behind them are provided and discussed. Finally, we showcase the practical value of our VPGTrans approach, by customizing two novel MLLMs, including VL-LLaMA and VL-Vicuna, with recently released LLaMA and Vicuna LLMs.


Vanilla Policy Gradient(VPG)-RL

#artificialintelligence

Reinforcement learning (RL) is the branch of machine learning that is concerned with making sequences of decisions. It considers an agent situated in an environment: each timestep, the agent takes an action, and it receives an observation and reward. An RL algorithm seeks to maximize the agent's total reward, given a previously unknown environment, through a trial-and-error learning process. The key idea of policy gradients is to push up the probabilities of actions that lead to higher return, and push down the probabilities of actions that lead to lower return, until you arrive at the optimal policy. Policy gradient methods are a type of reinforcement learning techniques that rely upon optimizing parametrized policies with respect to the expected return (long-term cumulative reward) by gradient descent. They do not suffer from many of the problems that have been marring traditional reinforcement learning approaches such as the lack of guarantees of a value function, the intractability problem resulting from uncertain state information and the complexity arising from continuous states & actions.


Proximal Policy Gradient: PPO with Policy Gradient

arXiv.org Artificial Intelligence

In this paper, we propose a new algorithm PPG (Proximal Policy Gradient), which is close to both VPG (vanilla policy gradient) and PPO (proximal policy optimization). The PPG objective is a partial variation of the VPG objective and the gradient of the PPG objective is exactly same as the gradient of the VPG objective. To increase the number of policy update iterations, we introduce the advantage-policy plane and design a new clipping strategy. We perform experiments in OpenAI Gym and Bullet robotics environments for ten random seeds. The performance of PPG is comparable to PPO, and the entropy decays slower than PPG. Thus we show that performance similar to PPO can be obtained by using the gradient formula from the original policy gradient theorem.


Learning Synergies between Pushing and Grasping with Self-supervised Deep Reinforcement Learning

arXiv.org Machine Learning

In this work, we demonstrate that it is possible to discover and learn these synergies from scratch through model-free deep reinforcement learning. Our method involves training two fully convolutional networks that map from visual observations to actions: one infers the utility of pushes for a dense pixel-wise sampling of end effector orientations and locations, while the other does the same for grasping. Both networks are trained jointly in a Q-learning framework and are entirely self-supervised by trial and error, where rewards are provided from successful grasps. In this way, our policy learns pushing motions that enable future grasps, while learning grasps that can leverage past pushes. During picking experiments in both simulation and real-world scenarios, we find that our system quickly learns complex behaviors amid challenging cases of clutter, and achieves better grasping success rates and picking efficiencies than baseline alternatives after only a few hours of training. We further demonstrate that our method is capable of generalizing to novel objects. Qualitative results (videos), code, pre-trained models, and simulation environments are available at http://vpg.cs.princeton.edu