Supplementary Material for " Training Over-parameterized Models with Non-decomposable Objectives " Algorithm 2 Reductions-based Algorithm for Constraining Coverage (2)

Neural Information Processing Systems 

Recall that the algorithms discussed in Section 2 have two intuitive steps: (i) update the multipliers based on the current classifier's performance and construct a gain matrix G; (ii) train a new classifier by optimizing a cost-sensitive loss ` These algorithms additionally incorporate the "two dataset" trick suggested by Cotter et al. [14] for better generalization, wherein the updates on are performed using a held-out validation set S In Algorithm 2, we seek to find a saddle-point for the Lagrangian max-min problem for (2). See Chen et al. [12], Cotter et al. [16] for theoretical guarantees for the learned classifier, which usually require the algorithms to output a stochastic classifier that averages over the individual iterates h Eban et al. [25] provide details of how the optimization of these metrics can be posed as constrained optimization problems, and in turn reduced More generally, using the reduction techniques from Narasimhan et al. [71], any learning problem of the following form can be reduced to cost-sensitive learning problems, and thus tackled by the proposed approach: max (C[h]) s.t. Narasimhan et al. [71] provide details of the Lagrangian primal-dual optimization for different metrics. We will find the following standard result to be useful in our proofs. Since the negative log is a strictly proper, in the sense of Gneiting and Raftery [30], Williamson et al. [95], we have that: Lemma 6 (Gneiting and Raftery [30], Williamson et al. [95]).