Visualising Policy-Reward Interplay to Inform Zeroth-Order Preference Optimisation of Large Language Models