Collab-Overcooked: Benchmarking and Evaluating Large Language Models as Collaborative Agents