Self-Rewarding PPO: Aligning Large Language Models with Demonstrations Only