Learning to Reason from Feedback at Test-Time