Improved Off-policy Reinforcement Learning in Biological Sequence Design