Luo, Yin
mChartQA: A universal benchmark for multimodal Chart Question Answer based on Vision-Language Alignment and Reasoning
Wei, Jingxuan, Xu, Nan, Chang, Guiyong, Luo, Yin, Yu, BiHui, Guo, Ruifeng
The goal of multimodal chart question answering is to automatically answer a natural language question about a chart to facilitate visual data analysis (Hoque et al., 2022), where the ability to understand and interact with visual data is essential (Masry et al., 2022). It has emerged as a crucial intersection of computer vision and natural language processing, addressing the growing demand for intelligent systems capable of interpreting complex visual data in charts (Masry et al., 2022). Beyond its general applications, multimodal chart question-answering plays a pivotal role in sectors requiring precise and rapid analysis of visual data. In the financial domain, it is indispensable for tasks such as financial report analysis (Wang et al., 2023a), decision support (Kafle et al., 2020), invoice parsing (Gerling and Lessmann, 2023), and contract review (Jie et al., 2023). Similarly, in the medical field, it significantly contributes to the digitization of patient records (Xu et al., 2021), medical insurance review (Meskó, 2023), diagnostic assistance (Othmani and Zeghina, 2022), and quality control (Schilcher et al., 2024) of medical records. Due to the richness and ambiguities of natural language and complex visual reasoning, multimodal chart question answering task requires to predict the answer in the intersection of information visualization, natural language processing, and human computer interactions (Hoque et al., 2022). Early approaches apply natural language processing techniques by largely depending on heuristics or grammarbased parsing techniques (Setlur et al., 2016; Srinivasan and Stasko, 2017; Hoque et al., 2017; Gao et al., 2015). Thanks to insufficient processing of complex linguistic phenomena, over-reliance on grammatical rules, and limited depth of understanding natural language, deep learning models have been introduced for understanding natural language queries about visualizations (Chaudhry et al., 2020; Singh and Shekhar, 2020; Reddy et al., 2019).
YAYI-UIE: A Chat-Enhanced Instruction Tuning Framework for Universal Information Extraction
Xiao, Xinglin, Wang, Yijie, Xu, Nan, Wang, Yuqi, Yang, Hanxuan, Wang, Minzheng, Luo, Yin, Wang, Lei, Mao, Wenji, Zeng, Daniel
The difficulty of the information extraction task lies in dealing with the task-specific label schemas and heterogeneous data structures. Recent work has proposed methods based on large language models to uniformly model different information extraction tasks. However, these existing methods are deficient in their information extraction capabilities for Chinese languages other than English. In this paper, we propose an end-to-end chat-enhanced instruction tuning framework for universal information extraction (YAYI-UIE), which supports both Chinese and English. Specifically, we utilize dialogue data and information extraction data to enhance the information extraction performance jointly. Experimental results show that our proposed framework achieves state-of-the-art performance on Chinese datasets while also achieving comparable performance on English datasets under both supervised settings and zero-shot settings.
Rational Sensibility: LLM Enhanced Empathetic Response Generation Guided by Self-presentation Theory
Sun, Linzhuang, Xu, Nan, Wei, Jingxuan, Yu, Bihui, Bu, Liping, Luo, Yin
Having the ability to empathize is crucial for accurately representing human behavior during conversations. Despite numerous research aim to improve the cognitive capability of models by incorporating external knowledge, there has been limited attention on the sensible and rational expression of the conversation itself, which are crucial components of the cognitive empathy. Guided by self-presentation theory in sociology, we have designed an innovative categorical approach that segregates historical dialogues into sensible and rational sentences and subsequently elucidate the context through the designed attention mechanism. However, the rational information within the conversation is restricted and the external knowledge used in previous methods have limitations of semantic contradiction and narrow vision field. Considering the impressive performance of LLM in the domain of intelligent agent. We employ LLaMA2-70b as a rational brain to analyze the profound logical information maintained in conversations, which assists the model assessing the balance of sensibility and rationality to produce quality empathetic responses. Experimental evaluations demonstrate that our method outperforms other comparable methods on both automatic and human evaluations.
YAYI 2: Multilingual Open-Source Large Language Models
Luo, Yin, Kong, Qingchao, Xu, Nan, Cao, Jia, Hao, Bao, Qu, Baoyu, Chen, Bo, Zhu, Chao, Zhao, Chenyang, Zhang, Donglei, Feng, Fan, Zhao, Feifei, Sun, Hailong, Yang, Hanxuan, Pan, Haojun, Liu, Hongyu, Guo, Jianbin, Du, Jiangtao, Wang, Jingyi, Li, Junfeng, Sun, Lei, Liu, Liduo, Dong, Lifeng, Liu, Lili, Wang, Lin, Zhang, Liwen, Wang, Minzheng, Wang, Pin, Yu, Ping, Li, Qingxiao, Yan, Rui, Zou, Rui, Li, Ruiqun, Huang, Taiwen, Wang, Xiaodong, Wu, Xiaofei, Peng, Xin, Zhang, Xina, Fang, Xing, Xiao, Xinglin, Hao, Yanni, Dong, Yao, Wang, Yigang, Liu, Ying, Jiang, Yongyu, Wang, Yungan, Wang, Yuqi, Wang, Zhangsheng, Yu, Zhaoxin, Luo, Zhen, Mao, Wenji, Wang, Lei, Zeng, Dajun
As the latest advancements in natural language processing, large language models (LLMs) have achieved human-level language understanding and generation abilities in many real-world tasks, and even have been regarded as a potential path to the artificial general intelligence. To better facilitate research on LLMs, many open-source LLMs, such as Llama 2 and Falcon, have recently been proposed and gained comparable performances to proprietary models. However, these models are primarily designed for English scenarios and exhibit poor performances in Chinese contexts. In this technical report, we propose YAYI 2, including both base and chat models, with 30 billion parameters. YAYI 2 is pre-trained from scratch on a multilingual corpus which contains 2.65 trillion tokens filtered by our pre-training data processing pipeline. The base model is aligned with human values through supervised fine-tuning with millions of instructions and reinforcement learning from human feedback. Extensive experiments on multiple benchmarks, such as MMLU and CMMLU, consistently demonstrate that the proposed YAYI 2 outperforms other similar sized open-source models.