Large Language Model Benchmarks in Medical Tasks