VALSE: A Task-Independent Benchmark for Vision and Language Models Centered on Linguistic Phenomena