mitigation strategy
Over the Returned Counterfactuals
In this appendix, we discuss a technique to optimize over the counterfactuals found by counterfactual explanation methods, such as [6]. We restate lemma 3.1 and provide a proof. Lemma 3.1 Assuming the counterfactual algorithm A (x) follows the form of the objective in equation 1, @@xcf G(x,A (x)) = 0, and m is the number of parameters in the model, we can write the derivative of counterfactual algorithm A with respect to model parameters as the Jacobian, @ @ A (x)= @2G(x,A (x)) @x2cf 1 G(x,xcf) (7) This problem is identical to a well-studied class of bi-level optimization problems in deep learning. In these problems, we must compute the derivative of a function with respect to some parameter (here) that includes an inner argmin, which itself depends on the parameter. We follow [44] to complete the proof.
A principled approach for data bias mitigation
How do you know if your data is fair? And if it isn't, what can you do about it? Machine learning models are increasingly used to make high-stakes decisions, from predicting who gets a loan to estimating the likelihood that someone will reoffend. But these models are only as good as the data they learn from [Shahbazi 2023]. If the training data is biased, the model's decisions will likely be biased too [Hort 2024, Pagano 2023].
incorporate feedback into our final revision. 4 [R1]: " I don't exactly see if small batch vs large batch captures this phenomenon; if yes should say explicitly. "
We thank the reviewers for the detailed and insightful reviews. As the reviews noted, our work 1) introduces "novel Smith et al. [2017] make an explicit connection between small vs. large batch "A small discussion on if the phenomenon has been observed for different datasets/tasks with different optimizers" The phenomenon may not be true for other optimizers such as Adam, though. "concept of "memorizable and generalizable", though intuitive, is sketchy and not formally explained ... authors We acknowledge that the terms "memorizable" and "generalizable" are potentially confusing. We will revise our terminology to clarify this distinction. By "inherently noisy", we refer to the fact that high noise in the datapoints will necessitate larger sample complexity.