When Fairness Isn't Statistical: The Limits of Machine Learning in Evaluating Legal Reasoning