Zhu, Chenghao
Is Your LLM Outdated? Evaluating LLMs at Temporal Generalization
Zhu, Chenghao, Chen, Nuo, Gao, Yufei, Zhang, Yunyi, Tiwari, Prayag, Wang, Benyou
The rapid advancement of Large Language Models (LLMs) highlights the urgent need for evolving evaluation methodologies that keep pace with improvements in language comprehension and information processing. However, traditional benchmarks, which are often static, fail to capture the continually changing information landscape, leading to a disparity between the perceived and actual effectiveness of LLMs in ever-changing real-world scenarios. Our study examines temporal generalization, which includes the ability to understand, predict, and generate text relevant to past, present, and future contexts, revealing significant temporal biases in LLMs. We propose an evaluation framework, for dynamically generating benchmarks from recent real-world predictions. Experiments demonstrate that LLMs struggle with temporal generalization, showing performance decline over time. These findings highlight the necessity for improved training and updating processes to enhance adaptability and reduce biases. Our code, dataset and benchmark are available at https://github.com/FreedomIntelligence/FreshBench.
MLLM-Bench, Evaluating Multi-modal LLMs using GPT-4V
Ge, Wentao, Chen, Shunian, Chen, Guiming, Chen, Junying, Chen, Zhihong, Yan, Shuo, Zhu, Chenghao, Lin, Ziyue, Xie, Wenya, Wang, Xidong, Gao, Anningzhe, Zhang, Zhiyi, Li, Jianquan, Wan, Xiang, Wang, Benyou
In the pursuit of Artificial General Intelligence (AGI), the integration of vision in language models has marked a significant milestone. The advent of vision-language models (MLLMs) like GPT-4V have expanded AI applications, aligning with the multi-modal capabilities of the human brain. However, evaluating the efficacy of MLLMs poses a substantial challenge due to the subjective nature of tasks that lack definitive answers. Existing automatic evaluation methodologies on multi-modal large language models rely on objective queries that have standard answers, inadequately addressing the nuances of creative and associative multi-modal tasks. To address this, we introduce MLLM-Bench, an innovative benchmark inspired by Vicuna, spanning a diverse array of scenarios, including Perception, Understanding, Applying, Analyzing, Evaluating, and Creation along with the ethical consideration. MLLM-Bench is designed to reflect user experience more accurately and provide a more holistic assessment of model performance. Comparative evaluations indicate a significant performance gap between existing open-source models and GPT-4V. We posit that MLLM-Bench will catalyze progress in the open-source community towards developing user-centric vision-language models that meet a broad spectrum of real-world applications. See online leaderboard in \url{https://mllm-bench.llmzoo.com}.