Towards Fusing Point Cloud and Visual Representations for Imitation Learning