eigv
Causal Understanding For Video Question Answering
Guda, Bhanu Prakash Reddy, Kulkarni, Tanmay, Sampath, Adithya, Sathyendra, Swarnashree Mysore
Video Question Answering is a challenging task, which requires the model to reason over multiple frames and understand the interaction between different objects to answer questions based on the context provided within the video, especially in datasets like NExT-QA (Xiao et al., 2021a) which emphasize on causal and temporal questions. Previous approaches leverage either sub-sampled information or causal intervention techniques along with complete video features to tackle the NExT-QA task. In this work we elicit the limitations of these approaches and propose solutions along four novel directions of improvements on theNExT-QA dataset. Our approaches attempts to compensate for the shortcomings in the previous works by systematically attacking each of these problems by smartly sampling frames, explicitly encoding actions and creating interventions that challenge the understanding of the model. Overall, for both single-frame (+6.3%) and complete-video (+1.1%) based approaches, we obtain the state-of-the-art results on NExT-QA dataset.
Generating with Confidence: Uncertainty Quantification for Black-box Large Language Models
Lin, Zhen, Trivedi, Shubhendu, Sun, Jimeng
Large language models (LLMs) specializing in natural language generation (NLG) have recently started exhibiting promising capabilities across a variety of domains. However, gauging the trustworthiness of responses generated by LLMs remains an open challenge, with limited research on uncertainty quantification (UQ) for NLG. Furthermore, existing literature typically assumes white-box access to language models, which is becoming unrealistic either due to the closed-source nature of the latest LLMs or computational constraints. In this work, we investigate UQ in NLG for black-box LLMs. We first differentiate uncertainty vs confidence: the former refers to the "dispersion" of the potential predictions for a fixed input, and the latter refers to the confidence on a particular prediction/generation. We then propose and compare several confidence/uncertainty metrics, applying them to selective NLG where unreliable results could either be ignored or yielded for further assessment. Experiments were carried out with several popular LLMs on question-answering datasets (for evaluation purposes). Results reveal that a simple metric for the semantic dispersion can be a reliable predictor of the quality of LLM responses, providing valuable insights for practitioners on uncertainty management when adopting LLMs. The code to replicate our experiments is available at https://github.com/zlin7/UQ-NLG.
Equivariant and Invariant Grounding for Video Question Answering
Li, Yicong, Wang, Xiang, Xiao, Junbin, Chua, Tat-Seng
Video Question Answering (VideoQA) is the task of answering the natural language questions about a video. Producing an answer requires understanding the interplay across visual scenes in video and linguistic semantics in question. However, most leading VideoQA models work as black boxes, which make the visual-linguistic alignment behind the answering process obscure. Such black-box nature calls for visual explainability that reveals ``What part of the video should the model look at to answer the question?''. Only a few works present the visual explanations in a post-hoc fashion, which emulates the target model's answering process via an additional method. Nonetheless, the emulation struggles to faithfully exhibit the visual-linguistic alignment during answering. Instead of post-hoc explainability, we focus on intrinsic interpretability to make the answering process transparent. At its core is grounding the question-critical cues as the causal scene to yield answers, while rolling out the question-irrelevant information as the environment scene. Taking a causal look at VideoQA, we devise a self-interpretable framework, Equivariant and Invariant Grounding for Interpretable VideoQA (EIGV). Specifically, the equivariant grounding encourages the answering to be sensitive to the semantic changes in the causal scene and question; in contrast, the invariant grounding enforces the answering to be insensitive to the changes in the environment scene. By imposing them on the answering process, EIGV is able to distinguish the causal scene from the environment information, and explicitly present the visual-linguistic alignment. Extensive experiments on three benchmark datasets justify the superiority of EIGV in terms of accuracy and visual interpretability over the leading baselines.