Measure what Matters: Psychometric Evaluation of AI with Situational Judgment Tests