Towards Video Text Visual Question Answering: Benchmark and Baseline