Depth $F_1$: Improving Evaluation of Cross-Domain Text Classification by Measuring Semantic Generalizability