Revisiting the Reliability of Psychological Scales on Large Language Models