What to do if language models disagree? Black-box model ensembling for textual and visual question answering