Training a Vision Language Model as Smartphone Assistant