Line Goes Up? Inherent Limitations of Benchmarks for Evaluating Large Language Models