Motion Generation from Fine-grained Textual Descriptions