Learning Fine-grained View-Invariant Representations from Unpaired Ego-Exo Videos via Temporal Alignment