Towards the Law of Capacity Gap in Distilling Language Models