Consider you have a prediction system h1 (example a photo tagger) whose output is consumed in real world (example tagging your photos on phone). Now, you train a system h2 whose aggregate metrics suggest that it is better than h1. Let's consider an unlabeled dataset D of examples (a pool of all user photos). Prediction update refers to the process where h2 is used to score examples in dataset D and update the predictions provided by h1. The problem here is that even though h2 is better than h1 globally, we haven't determined if h2 is significantly worse for some users or some specific pattern of examples.
Jul-25-2021, 05:00:14 GMT