Review for NeurIPS paper: A Scalable MIP-based Method for Learning Optimal Multivariate Decision Trees

Neural Information Processing Systems 

Clarity: The main paper is mostly written fairly well, the Appendix less so (lots of typos at least). The work nevertheless lacks clarity because several relevant details are moved to the Supplementary part, and some aspects are not mentioned at all (at least in the main paper). The Appendix even contains a section regarding categorical features that is not even hinted at in the main paper. Clarification is needed, e.g., at the following points: - p.2, l.70-73 is too vague, the meaning is unclear - pls. clarify - p.2, l. 85f: clarify what "[...] i enters leaf node l " means (i.e., that data pt. If \hat{y}_i denotes a predicted label, then why is it real-valued and not in [Y]? (Also regarding the description on p.3, l.96f: why should y_i - \hat{y}_i \geq 1 here -- \hat{y}_i is in R, so couldn't it be, say, y_i - delta for some small delta?) - p.3, l.92: perhaps clarify "tree sparsity" -- actually here this means sparsity of the decision hyperplanes, no the tree itself - The 1-norm is used in the MIP (1) and several times in the text later called "linear" (e.g., p.4, l.136), but this is technically incorrect.