Improved Speech Separation with Time-and-Frequency Cross-domain Joint Embedding and Clustering