Where Do We Go From Here? Guidelines For Offline Recommender Evaluation

Schnabel, Tobias

arXiv.org Artificial Intelligence 

Various studies in recent years have pointed out large issues in Despite growing work that tests recommender systems online, offline the offline evaluation of recommender systems [11, 12], making it evaluation is still by far the most popular evaluation paradigm difficult to assess whether true progress has been made. However, used in recent research publications [38]. What has been troubling is there has been little research into what set of practices should that an increasing amount of research has pointed out important issues serve as a starting point during experimentation. In this paper, with common protocols for offline evaluation of recommender we examine four larger issues in recommender system research systems [5, 12, 14, 24] even leading some researchers to publicly regarding uncertainty estimation, generalization, hyperparameter call it a community-wide crisis [11]. The cumulative effect of these optimization and dataset pre-processing in more detail to arrive issues became widely visible when Dacrema et al. [12] performed a at a set of guidelines. We present a TrainRec, a lightweight and series of reproducibility experiments showing that reported gains flexible toolkit for offline training and evaluation of recommender vanished in most cases when baselines were tuned properly. In a systems that implements these guidelines. Different from other similar vein, Rendle et al. [32] showed that proper hyperparameter frameworks, TrainRec is a toolkit that focuses on experimentation selection makes traditional matrix factorization-based approaches alone, offering flexible modules that can be can be used together or competitive to more recent methods. Overall, these discoveries mirror in isolation.