Xiao, Qian
Chart-HQA: A Benchmark for Hypothetical Question Answering in Charts
Chen, Xiangnan, Fang, Yuancheng, Xiao, Qian, Li, Juncheng, Lin, Jun, Tang, Siliang, Yang, Yi, Zhuang, Yueting
Multimodal Large Language Models (MLLMs) have garnered significant attention for their strong visual-semantic understanding. Most existing chart benchmarks evaluate MLLMs' ability to parse information from charts to answer questions. However, they overlook the inherent output biases of MLLMs, where models rely on their parametric memory to answer questions rather than genuinely understanding the chart content. To address this limitation, we introduce a novel Chart Hypothetical Question Answering (HQA) task, which imposes assumptions on the same question to compel models to engage in counterfactual reasoning based on the chart content. Furthermore, we introduce HAI, a human-AI interactive data synthesis approach that leverages the efficient text-editing capabilities of LLMs alongside human expert knowledge to generate diverse and high-quality HQA data at a low cost. Using HAI, we construct Chart-HQA, a challenging benchmark synthesized from publicly available data sources. Evaluation results on 18 MLLMs of varying model sizes reveal that current models face significant generalization challenges and exhibit imbalanced reasoning performance on the HQA task.
LuminLab: An AI-Powered Building Retrofit and Energy Modelling Platform
Credit, Kevin, Xiao, Qian, Lehane, Jack, Vazquez, Juan, Liu, Dan, De Figueiredo, Leo
This paper describes the technical and conceptual development of the LuminLab platform, an online tool that integrates a purpose-fit human-centric AI chatbot and predictive energy model into a streamlined front-end that can rapidly produce and discuss building retrofit plans in natural language. The platform provides users with the ability to engage with a range of possible retrofit pathways tailored to their individual budget and building needs on-demand. Given the complicated and costly nature of building retrofit projects, which rely on a variety of stakeholder groups with differing goals and incentives, we feel that AI-powered tools such as this have the potential to pragmatically de-silo knowledge, improve communication, and empower individual homeowners to undertake incremental retrofit projects that might not happen otherwise.
Global Structure Knowledge-Guided Relation Extraction Method for Visually-Rich Document
Chen, Xiangnan, Xiao, Qian, Li, Juncheng, Dong, Duo, Lin, Jun, Liu, Xiaozhong, Tang, Siliang
Visual Relation Extraction (VRE) is a powerful means of discovering relationships between entities within visually-rich documents. Existing methods often focus on manipulating entity features to find pairwise relations, yet neglect the more fundamental structural information that links disparate entity pairs together. The absence of global structure information may make the model struggle to learn long-range relations and easily predict conflicted results. To alleviate such limitations, we propose a GlObal Structure knowledge-guided relation Extraction (GOSE) framework. GOSE initiates by generating preliminary relation predictions on entity pairs extracted from a scanned image of the document. Subsequently, global structural knowledge is captured from the preceding iterative predictions, which are then incorporated into the representations of the entities. This "generate-capture-incorporate" cycle is repeated multiple times, allowing entity representations and global structure knowledge to be mutually reinforced. Extensive experiments validate that GOSE not only outperforms existing methods in the standard fine-tuning setting but also reveals superior cross-lingual learning capabilities; indeed, even yields stronger data-efficient performance in the low-resource setting. The code for GOSE will be available at https://github.com/chenxn2020/GOSE.
Compact Twice Fusion Network for Edge Detection
Li, Yachuan, Li, Zongmin, P., Xavier Soria, Yang, Chaozhi, Xiao, Qian, Bai, Yun, Li, Hua, Wang, Xiangdong
The significance of multi-scale features has been gradually recognized by the edge detection community. However, the fusion of multi-scale features increases the complexity of the model, which is not friendly to practical application. In this work, we propose a Compact Twice Fusion Network (CTFN) to fully integrate multi-scale features while maintaining the compactness of the model. CTFN includes two lightweight multi-scale feature fusion modules: a Semantic Enhancement Module (SEM) that can utilize the semantic information contained in coarse-scale features to guide the learning of fine-scale features, and a Pseudo Pixel-level Weighting (PPW) module that aggregate the complementary merits of multi-scale features by assigning weights to all features. Notwithstanding all this, the interference of texture noise makes the correct classification of some pixels still a challenge. For these hard samples, we propose a novel loss function, coined Dynamic Focal Loss, which reshapes the standard cross-entropy loss and dynamically adjusts the weights to correct the distribution of hard samples. We evaluate our method on three datasets, i.e., BSDS500, NYUDv2, and BIPEDv2. Compared with state-of-the-art methods, CTFN achieves competitive accuracy with less parameters and computational cost. Apart from the backbone, CTFN requires only 0.1M additional parameters, which reduces its computation cost to just 60% of other state-of-the-art methods. The codes are available at https://github.com/Li-yachuan/CTFN-pytorch-master.