Constructing Multimodal Datasets from Scratch for Rapid Development of a Japanese Visual Language Model