Aligning Large Language Models by On-Policy Self-Judgment