Encoding and Controlling Global Semantics for Long-form Video Question Answering