Conditional Sound Generation Using Neural Discrete Time-Frequency Representation Learning