Are we going MAD? Benchmarking Multi-Agent Debate between Language Models for Medical Q&A