Learning novel representations of variable sources from multi-modal $\textit{Gaia}$ data via autoencoders