Not enough data to create a plot.
Try a different view from the menu above.
Jiang, Meng
MultiChartQA: Benchmarking Vision-Language Models on Multi-Chart Problems
Zhu, Zifeng, Jia, Mengzhao, Zhang, Zhihan, Li, Lang, Jiang, Meng
Multimodal Large Language Models (MLLMs) have demonstrated impressive abilities across various tasks, including visual question answering and chart comprehension, yet existing benchmarks for chart-related tasks fall short in capturing the complexity of real-world multi-chart scenarios. Current benchmarks primarily focus on single-chart tasks, neglecting the multi-hop reasoning required to extract and integrate information from multiple charts, which is essential in practical applications. To fill this gap, we introduce MultiChartQA, a benchmark that evaluates MLLMs' capabilities in four key areas: direct question answering, parallel question answering, comparative reasoning, and sequential reasoning. Our evaluation of a wide range of MLLMs reveals significant performance gaps compared to humans. These results highlight the challenges in multi-chart comprehension and the potential of MultiChartQA to drive advancements in this field. Our code and data are available at https://github.com/Zivenzhu/Multi-chart-QA
Enhancing Mathematical Reasoning in LLMs by Stepwise Correction
Wu, Zhenyu, Zeng, Qingkai, Zhang, Zhihan, Tan, Zhaoxuan, Shen, Chao, Jiang, Meng
Best-of-N decoding methods instruct large language models (LLMs) to generate multiple solutions, score each using a scoring function, and select the highest scored as the final answer to mathematical reasoning problems. However, this repeated independent process often leads to the same mistakes, making the selected solution still incorrect. We propose a novel prompting method named Stepwise Correction (StepCo) that helps LLMs identify and revise incorrect steps in their generated reasoning paths. It iterates verification and revision phases that employ a process-supervised verifier. The verify-then-revise process not only improves answer correctness but also reduces token consumption with fewer paths needed to generate. With StepCo, a series of LLMs demonstrate exceptional performance. Notably, using GPT-4o as the backend LLM, StepCo achieves an average accuracy of 94.1 across eight datasets, significantly outperforming the state-of-the-art Best-of-N method by +2.4, while reducing token consumption by 77.8%.
MentalGLM Series: Explainable Large Language Models for Mental Health Analysis on Chinese Social Media
Zhai, Wei, Bai, Nan, Zhao, Qing, Li, Jianqiang, Wang, Fan, Qi, Hongzhi, Jiang, Meng, Wang, Xiaoqin, Yang, Bing Xiang, Fu, Guanghui
As the prevalence of mental health challenges, social media has emerged as a key platform for individuals to express their emotions.Deep learning tends to be a promising solution for analyzing mental health on social media. However, black box models are often inflexible when switching between tasks, and their results typically lack explanations. With the rise of large language models (LLMs), their flexibility has introduced new approaches to the field. Also due to the generative nature, they can be prompted to explain decision-making processes. However, their performance on complex psychological analysis still lags behind deep learning. In this paper, we introduce the first multi-task Chinese Social Media Interpretable Mental Health Instructions (C-IMHI) dataset, consisting of 9K samples, which has been quality-controlled and manually validated. We also propose MentalGLM series models, the first open-source LLMs designed for explainable mental health analysis targeting Chinese social media, trained on a corpus of 50K instructions. The proposed models were evaluated on three downstream tasks and achieved better or comparable performance compared to deep learning models, generalized LLMs, and task fine-tuned LLMs. We validated a portion of the generated decision explanations with experts, showing promising results. We also evaluated the proposed models on a clinical dataset, where they outperformed other LLMs, indicating their potential applicability in the clinical field. Our models show strong performance, validated across tasks and perspectives. The decision explanations enhance usability and facilitate better understanding and practical application of the models. Both the constructed dataset and the models are publicly available via: https://github.com/zwzzzQAQ/MentalGLM.
TOWER: Tree Organized Weighting for Evaluating Complex Instructions
Ziems, Noah, Zhang, Zhihan, Jiang, Meng
Evaluating the ability of large language models (LLMs) to follow complex human-written instructions is essential for their deployment in real-world applications. While benchmarks like Chatbot Arena use human judges to assess model performance, they are resource-intensive and time-consuming. Alternative methods using LLMs as judges, such as AlpacaEval, MT Bench, WildBench, and InFoBench offer improvements but still do not capture that certain complex instruction aspects are more important than others to follow. To address this gap, we propose a novel evaluation metric, \textsc{TOWER}, that incorporates human-judged importance into the assessment of complex instruction following. We show that human annotators agree with tree-based representations of these complex instructions nearly as much as they agree with other human annotators. We release tree-based annotations of the InFoBench dataset and the corresponding evaluation code to facilitate future research.
Multimodal Large Language Models for Inverse Molecular Design with Retrosynthetic Planning
Liu, Gang, Sun, Michael, Matusik, Wojciech, Jiang, Meng, Chen, Jie
While large language models (LLMs) have integrated images, adapting them to graphs remains challenging, limiting their applications in materials and drug design. This difficulty stems from the need for coherent autoregressive generation across texts and graphs. To address this, we introduce Llamole, the first multimodal LLM capable of interleaved text and graph generation, enabling molecular inverse design with retrosynthetic planning. Llamole integrates a base LLM with the Graph Diffusion Transformer and Graph Neural Networks for multi-conditional molecular generation and reaction inference within texts, while the LLM, with enhanced molecular understanding, flexibly controls activation among the different graph modules. Additionally, Llamole integrates A* search with LLM-based cost functions for efficient retrosynthetic planning. We create benchmarking datasets and conduct extensive experiments to evaluate Llamole against in-context learning and supervised fine-tuning. Llamole significantly outperforms 14 adapted LLMs across 12 metrics for controllable molecular design and retrosynthetic planning.
Leopard: A Vision Language Model For Text-Rich Multi-Image Tasks
Jia, Mengzhao, Yu, Wenhao, Ma, Kaixin, Fang, Tianqing, Zhang, Zhihan, Ouyang, Siru, Zhang, Hongming, Jiang, Meng, Yu, Dong
Text-rich images, where text serves as the central visual element guiding the overall understanding, are prevalent in real-world applications, such as presentation slides, scanned documents, and webpage snapshots. Tasks involving multiple text-rich images are especially challenging, as they require not only understanding the content of individual images but reasoning about inter-relationships and logical flows across multiple visual inputs. Despite the importance of these scenarios, current multimodal large language models (MLLMs) struggle to handle such tasks due to two key challenges: (1) the scarcity of high-quality instruction tuning datasets for text-rich multi-image scenarios, and (2) the difficulty in balancing image resolution with visual feature sequence length. To address these challenges, we propose Leopard, a MLLM designed specifically for handling vision-language tasks involving multiple text-rich images. First, we curated about one million high-quality multimodal instruction-tuning data, tailored to text-rich, multi-image scenarios. Second, we developed an adaptive high-resolution multi-image encoding module to dynamically optimize the allocation of visual sequence length based on the original aspect ratios and resolutions of the input images. Experiments across a wide range of benchmarks demonstrate our model's superior capabilities in text-rich, multi-image evaluations and competitive performance in general domain evaluations.
LLMEasyQuant -- An Easy to Use Toolkit for LLM Quantization
Liu, Dong, Jiang, Meng, Pister, Kaiser
Quantization is the process of mapping a large set of input values to a smaller set of output values, often integers. It is a key technique in digital signal processing where continuous signals are mapped to discrete digital values, and it reduces the data's precision to make storage and computation more efficient while attempting to retain essential information. With the development of Large Language Models (LLMs), the models have grown extremely large, so the memory usage and inference speed are greatly limited by the size of the model. Consequently, as one of the most popular technique for model compression, quantization has many variants now used for LLM compression and inference acceleration. The goal of quantization in LLMs is to reduce their size while minimizing its influence on inference speed.
GraphSnapShot: Graph Machine Learning Acceleration with Fast Storage and Retrieval
Liu, Dong, Waleffe, Roger, Jiang, Meng, Venkataraman, Shivaram
In our recent research, we have developed a framework called GraphSnapShot, which has been proven an useful tool for graph learning acceleration. The core idea of GraphSnapShot is to capture and update the state of local graph structures dynamically, just like taking snapshots of graphs. GraphSnapShot is designed to efficiently capture, store and update the dynamic snapshots of graph data, enabling us to track patterns in the structure of graph networks. This technique is useful for most graph learning tasks that relies on topology analysis or networks are constantly evolving, such as social media analysis, biological networks, or any system where the relationships between entities change over time. The key components of GraphSnapShot is the GraphSDSampler. GraphS-DSampler can efficiently capture, update, retrieve and store graph snapshots of topology while doing computation at the same time, which makes graph learning computation significantly faster. In experiments, GraphSnapShot shows efficiency. It can promote computation speed significantly compared to traditional NeighborhodSamplers implemented in dgl, and it can reduce the GPU & memory usage during training, with little loss of accuracy. Experimental results show that the GraphSnapShot has potential to be a powerful tool for large graph training acceleration.
MT2ST: Adaptive Multi-Task to Single-Task Learning
Liu, Dong, Jiang, Meng
The conventional training approaches often face challenges in balancing the breadth of multi-task learning (MTL) with the depth of single-task learning (STL). To address this issue, we introduce the Multi-Task to Single-Task (MT2ST) framework, a groundbreaking approach that can combine the generalizability of MTL with the precision of STL. Our work include two strategies: 'Diminish' and 'Switch'. 'Diminish' Strategy will gradually reduce the influence of auxiliary tasks, while the 'Switch' strategy involves a shift from multi-tasking to single-tasking at a specific timepoint at the training process. In this paper, we propose the Multi-Task to Single-Task (MT2ST) framework, a novel approach that significantly enhances the efficiency and accuracy of word embedding training while concurrently addressing prevalent issues such as overfitting. Our empirical studies demonstrate that MT2ST can reduce training time by 67\% when contrasted with single-task learning approaches, and by 13\% compared to traditional multi-task learning methods. These findings underscore MT2ST's potential to be a powerful tools for word embedding training acceleration.
Large Language Models Can Self-Correct with Minimal Effort
Wu, Zhenyu, Zeng, Qingkai, Zhang, Zhihan, Tan, Zhaoxuan, Shen, Chao, Jiang, Meng
Intrinsic self-correct was a method that instructed large language models (LLMs) to verify and correct their responses without external feedback. Unfortunately, the study concluded that the LLMs could not self-correct reasoning yet. We find that a simple yet effective verification method can unleash inherent capabilities of the LLMs. That is to mask a key condition in the question, add the current response to construct a verification question, and predict the condition to verify the response. The condition can be an entity in an open-domain question or a numeric value in a math question, which requires minimal effort (via prompting) to identify. We propose an iterative verify-then-correct framework to progressively identify and correct (probably) false responses, named ProCo. We conduct experiments on three reasoning tasks. On average, ProCo, with GPT-3.5-Turbo as the backend LLM, yields $+6.8$ exact match on four open-domain question answering datasets, $+14.1$ accuracy on three arithmetic reasoning datasets, and $+9.6$ accuracy on a commonsense reasoning dataset, compared to Self-Correct.