Learning to Reason with Relational Video Representation for Question Answering