Understanding challenges to the interpretation of disaggregated evaluations of algorithmic fairness