Towards Fine-Grained Video Question Answering