Control Variates for Slate Off-Policy Evaluation

Apr-25-2026, 00:30:26 GMT–Neural Information Processing Systems

We study the problem of off-policy evaluation from batched contextual bandit data with multidimensional actions, often termed slates. The problem is common to recommender systems and user-interface optimization, and it is particularly challenging because of the combinatorially-sized action space. Swaminathan et al. (2017) have proposed the pseudoinverse (PI) estimator under the assumption that the conditional mean rewards are additive in actions. Using control variates, we consider a large class of unbiased estimators that includes as specific cases the PI estimator and (asymptotically) its self-normalized variant. By optimizing over this class, we obtain new estimators with risk improvement guarantees over both the PI and the self-normalized PI estimators.

artificial intelligence, estimator, machine learning, (15 more...)

Neural Information Processing Systems

Apr-25-2026, 00:30:26 GMT

Conferences PDF

Add feedback

Genre:
- Research Report (0.68)

Industry:
- Media (0.69)
- Information Technology (0.46)

Technology:
- Information Technology
  - Data Science (0.94)
  - Artificial Intelligence
    - Machine Learning (1.00)
    - Representation & Reasoning > Personal Assistant Systems (0.88)

Duplicate Docs Excel Report

Title
1e0b802d5c0e1e8434a771ba7ff2c301-Paper.pdf

Similar Docs Excel Report more

Title	Similarity	Source
None found