Stable Online and Offline Reinforcement Learning for Antibody CDRH3 Design