STAIR: Spatial-Temporal Reasoning with Auditable Intermediate Results for Video Question Answering