Toward Generalizable Evaluation in the LLM Era: A Survey Beyond Benchmarks

Open in new window