Do Vision & Language Decoders use Images and Text equally? How Self-consistent are their Explanations?