AITopics | Zhao, Wanxu

Collaborating Authors

Zhao, Wanxu

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Distill Visual Chart Reasoning Ability from LLMs to MLLMs

He, Wei, Xi, Zhiheng, Zhao, Wanxu, Fan, Xiaoran, Ding, Yiwen, Shan, Zifei, Gui, Tao, Zhang, Qi, Huang, Xuanjing

arXiv.org Artificial IntelligenceOct-24-2024

Solving complex chart Q&A tasks requires advanced visual reasoning abilities in multimodal large language models (MLLMs). Recent studies highlight that these abilities consist of two main parts: recognizing key information from visual inputs and conducting reasoning over it. Thus, a promising approach to enhance MLLMs is to construct relevant training data focusing on the two aspects. However, collecting and annotating complex charts and questions is costly and timeconsuming, and ensuring the quality of annotated answers remains a challenge. In this paper, we propose Code-as-Intermediary Translation (CIT), a cost-effective, efficient and easily scalable data synthesis method for distilling visual reasoning abilities from LLMs to MLLMs. The code serves as an intermediary that translates visual chart representations into textual representations, enabling LLMs to understand cross-modal information. QA, a dataset containing 3k reasoning-intensive charts and 20k Q&A pairs to enhance both recognition and reasoning abilities. Experiments show that when fine-tuned with our data, models not only perform well on chart-related benchmarks, but also demonstrate improved multimodal reasoning abilities on general mathematical benchmarks like MathVista. Multimodal large language models (MLLMs) have made significant achievements, particularly in visual recognition tasks (OpenAI, 2024a; Anthropic, 2024). While they can handle simple visual inputs well, there has been a growing emphasis on complex chart understanding, driven by the widespread use of charts in real-world contexts (Masry et al., 2022; Huang et al., 2024). However, addressing reasoning-intensive questions involving charts remains challenging for these models. Existing benchmarks underscore the need for more advanced and generalized visual reasoning abilities, which are still underdeveloped in current MLLMs (Wang et al., 2024c; Lu et al., 2024). Our analysis of the error distribution in ChartQA (Figure 1) also highlights two main types of model failure: 62% of errors stem from misrecognition, while 36% arise from reasoning mistakes after correct recognition. This shows that even advanced MLLMs struggle with basic recognition and often make superficial reasoning errors. In contrast, humans excel at these tasks by purposefully identifying query-relevant information from images and engaging in step-by-step reasoning (Wang et al., 2024c;a). In light of these findings, enabling models to solve problems in a human-like manner, becomes essential for advancing visual reasoning performance. One promising strategy is to distill the rationales of reasoning from experts, such as human or stronger models (Han et al., 2023; Meng et al., 2024; Masry et al., 2024a;b) However, creating highquality training data for chart-related tasks is costly and time-consuming.

large language model, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2410.18798

Country:

Europe > Austria (0.28)
North America > United States (0.28)
North America > Mexico (0.28)
Asia > Middle East (0.28)

Genre:

Research Report > New Finding (0.46)
Research Report > Promising Solution (0.34)

Industry: Education > Educational Setting (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning > Generative AI (0.34)

Add feedback

SafeAligner: Safety Alignment against Jailbreak Attacks via Response Disparity Guidance

Huang, Caishuang, Zhao, Wanxu, Zheng, Rui, Lv, Huijie, Dou, Shihan, Li, Sixian, Wang, Xiao, Zhou, Enyu, Ye, Junjie, Yang, Yuming, Gui, Tao, Zhang, Qi, Huang, Xuanjing

arXiv.org Artificial IntelligenceJun-28-2024

As the development of large language models (LLMs) rapidly advances, securing these models effectively without compromising their utility has become a pivotal area of research. However, current defense strategies against jailbreak attacks (i.e., efforts to bypass security protocols) often suffer from limited adaptability, restricted general capability, and high cost. To address these challenges, we introduce SafeAligner, a methodology implemented at the decoding stage to fortify defenses against jailbreak attacks. We begin by developing two specialized models: the Sentinel Model, which is trained to foster safety, and the Intruder Model, designed to generate riskier responses. SafeAligner leverages the disparity in security levels between the responses from these models to differentiate between harmful and beneficial tokens, effectively guiding the safety alignment by altering the output token distribution of the target model. Extensive experiments show that SafeAligner can increase the likelihood of beneficial tokens, while reducing the occurrence of harmful ones, thereby ensuring secure alignment with minimal loss to generality.

large language model, machine learning, safealigner, (22 more...)

arXiv.org Artificial Intelligence

2406.18118

Country:

Asia > China (0.14)
Europe > Germany (0.14)

Genre: Research Report > New Finding (0.46)

Industry:

Information Technology > Security & Privacy (1.00)
Law (0.93)

Technology:

Information Technology > Security & Privacy (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.95)

Add feedback