Benchmarking Large Language Models on Controllable Generation under Diversified Instructions