StackEval: Benchmarking LLMs in Coding Assistance

Neural Information Processing Systems 

LLMs' proficiency as judges for coding tasks using a curated, human-annotated dataset, exploring their evaluation capabilities and potential biases, including whether they favor their own generated solutions. Our findings underscore the potential of these benchmarks to advance LLM development and application in coding assistance.

Similar Docs  Excel Report  more

TitleSimilaritySource
None found