HumanEval Pro and MBPP Pro: Evaluating Large Language Models on Self-invoking Code Generation

Open in new window