Large language models in biomedical natural language processing: benchmarks, baselines, and recommendations