MDEval: Evaluating and Enhancing Markdown Awareness in Large Language Models

Chen, Zhongpu, Liu, Yinfeng, Shi, Long, Wang, Zhi-Jie, Chen, Xingyan, Zhao, Yu, Ren, Fuji

Jan-24-2025–arXiv.org Artificial Intelligence

Large language models (LLMs) are expected to offer structured Markdown responses for the sake of readability in web chatbots (e.g., ChatGPT). Although there are a myriad of metrics to evaluate LLMs, they fail to evaluate the readability from the view of output content structure. To this end, we focus on an overlooked yet important metric -- Markdown Awareness, which directly impacts the readability and structure of the content generated by these language models. In this paper, we introduce MDEval, a comprehensive benchmark to assess Markdown Awareness for LLMs, by constructing a dataset with 20K instances covering 10 subjects in English and Chinese. Unlike traditional model-based evaluations, MDEval provides excellent interpretability by combining model-based generation tasks and statistical methods. Our results demonstrate that MDEval achieves a Spearman correlation of 0.791 and an accuracy of 84.1% with human, outperforming existing methods by a large margin. Extensive experimental results also show that through fine-tuning over our proposed dataset, less performant open-source models are able to achieve comparable performance to GPT-4o in terms of Markdown Awareness. To ensure reproducibility and transparency, MDEval is open sourced at https://github.com/SWUFE-DB-Group/MDEval-Benchmark.

large language model, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

Jan-24-2025

arXiv.org PDF

Add feedback

Country:
- Oceania > Australia
  - New South Wales > Sydney (0.05)
- North America > United States
  - New York > New York County > New York City (0.04)
- Asia
  - Thailand > Bangkok
    - Bangkok (0.04)
  - China
    - Sichuan Province > Chengdu (0.05)
    - Chongqing Province > Chongqing (0.04)

Genre:
- Research Report > New Finding (1.00)

Technology:
- Information Technology > Artificial Intelligence
  - Natural Language
    - Large Language Model (1.00)
    - Chatbot (1.00)
  - Machine Learning > Neural Networks
    - Deep Learning (1.00)