Unsupervised Multi-modal Feature Alignment for Time Series Representation Learning