Self-Chained Image-Language Model for Video Localization and Question Answering