A Benchmark for Long-Form Medical Question Answering