Unsupervised Learning of Spoken Language with Visual Context