linearity assumption
Review for NeurIPS paper: On Reward-Free Reinforcement Learning with Linear Function Approximation
I would just like to confirm my understanding of the algorithmic contributions of this work. As far as I understand, Jin et al [2019] propose a learning algorithm for the standard RL case with linear function approximation in linear MDPs. Then Jin et al [2020] propose a method for efficient exploration in the reward-free RL case. This is for normal MDPs but in the tabular setting. In that work, exploration is achieved by constructing a reward function where the reward is 1 for states that are "significant", and 0 otherwise, and then solving the resulting task with an efficient learning algorithm.
Off-policy evaluation for slate recommendation
This paper studies the evaluation of policies that recommend an ordered set of items (e.g., a ranking) based on some context--a common scenario in web search, ads, and recommendation. We build on techniques from combinatorial bandits to introduce a new practical estimator that uses logged data to estimate a policy's performance. A thorough empirical evaluation on real-world data reveals that our estimator is accurate in a variety of settings, including as a subroutine in a learningto-rank task, where it achieves competitive performance. We derive conditions under which our estimator is unbiased--these conditions are weaker than prior heuristics for slate evaluation--and experimentally demonstrate a smaller bias than parametric approaches, even when these conditions are violated. Finally, our theory and experiments also show exponential savings in the amount of required data compared with general unbiased estimators.
Off-policy evaluation for slate recommendation
Swaminathan, Adith, Krishnamurthy, Akshay, Agarwal, Alekh, Dudik, Miro, Langford, John, Jose, Damien, Zitouni, Imed
This paper studies the evaluation of policies that recommend an ordered set of items (e.g., a ranking) based on some context---a common scenario in web search, ads, and recommendation. We build on techniques from combinatorial bandits to introduce a new practical estimator that uses logged data to estimate a policy's performance. A thorough empirical evaluation on real-world data reveals that our estimator is accurate in a variety of settings, including as a subroutine in a learning-to-rank task, where it achieves competitive performance. We derive conditions under which our estimator is unbiased---these conditions are weaker than prior heuristics for slate evaluation---and experimentally demonstrate a smaller bias than parametric approaches, even when these conditions are violated. Finally, our theory and experiments also show exponential savings in the amount of required data compared with general unbiased estimators.
Linearity assumption in Linear Regression
This is actually a good question. For a categorical variable, can the model say that some veles are significant, some levels are not. Typically after a regression we look at the ANOVA (Analysis of Variance) table. There we have 1 row per independent variable. In other words, in My example we will see a single row corresponding to the variable COLOR (as opposed to say 2 rows for I_green and I_blue).
Conquering the rating bound problem in neighborhood-based collaborative filtering: a function recovery approach
Huang, Junming, Cheng, Xue-Qi, Shen, Hua-Wei, Sun, Xiaoming, Zhou, Tao, Jin, Xiaolong
As an important tool for information filtering in the era of socialized web, recommender systems have witnessed rapid development in the last decade. As benefited from the better interpretability, neighborhood-based collaborative filtering techniques, such as item-based collaborative filtering adopted by Amazon, have gained a great success in many practical recommender systems. However, the neighborhood-based collaborative filtering method suffers from the rating bound problem, i.e., the rating on a target item that this method estimates is bounded by the observed ratings of its all neighboring items. Therefore, it cannot accurately estimate the unobserved rating on a target item, if its ground truth rating is actually higher (lower) than the highest (lowest) rating over all items in its neighborhood. In this paper, we address this problem by formalizing rating estimation as a task of recovering a scalar rating function. With a linearity assumption, we infer all the ratings by optimizing the low-order norm, e.g., the $l_1/2$-norm, of the second derivative of the target scalar function, while remaining its observed ratings unchanged. Experimental results on three real datasets, namely Douban, Goodreads and MovieLens, demonstrate that the proposed approach can well overcome the rating bound problem. Particularly, it can significantly improve the accuracy of rating estimation by 37% than the conventional neighborhood-based methods.