When Large Language Models are Reliable for Judging Empathic Communication