Learning in complex action spaces without policy gradients