Weighted importance sampling for off-policy learning with linear function approximation