AITopics | Wang, Changbo

Collaborating Authors

Wang, Changbo

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

GAOKAO-Eval: Does high scores truly reflect strong capabilities in LLMs?

Lei, Zhikai, Liang, Tianyi, Hu, Hanglei, Zhang, Jin, Zhou, Yunhua, Shao, Yunfan, Li, Linyang, Li, Chenchui, Wang, Changbo, Yan, Hang, Guo, Qipeng

arXiv.org Artificial IntelligenceDec-13-2024

Large Language Models (LLMs) are commonly evaluated using human-crafted benchmarks, under the premise that higher scores implicitly reflect stronger human-like performance. However, there is growing concern that LLMs may ``game" these benchmarks due to data leakage, achieving high scores while struggling with tasks simple for humans. To substantively address the problem, we create GAOKAO-Eval, a comprehensive benchmark based on China's National College Entrance Examination (Gaokao), and conduct ``closed-book" evaluations for representative models released prior to Gaokao. Contrary to prevailing consensus, even after addressing data leakage and comprehensiveness, GAOKAO-Eval reveals that high scores still fail to truly reflect human-aligned capabilities. To better understand this mismatch, We introduce the Rasch model from cognitive psychology to analyze LLM scoring patterns and identify two key discrepancies: 1) anomalous consistent performance across various question difficulties, and 2) high variance in performance on questions of similar difficulty. In addition, We identified inconsistent grading of LLM-generated answers among teachers and recurring mistake patterns. we find that the phenomenons are well-grounded in the motivations behind OpenAI o1, and o1's reasoning-as-difficulties can mitigate the mismatch. These results show that GAOKAO-Eval can reveal limitations in LLM capabilities not captured by current benchmarks and highlight the need for more LLM-aligned difficulty analysis.

benchmark, large language model, machine learning, (19 more...)

arXiv.org Artificial Intelligence

2412.10056

Country:

Asia > China (1.00)
Europe > Austria > Vienna (0.14)

Genre: Research Report > New Finding (0.87)

Industry: Education > Assessment & Standards > Student Performance (0.34)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.90)

Add feedback

SalienTime: User-driven Selection of Salient Time Steps for Large-Scale Geospatial Data Visualization

Chen, Juntong, Huang, Haiwen, Ye, Huayuan, Peng, Zhong, Li, Chenhui, Wang, Changbo

arXiv.org Artificial IntelligenceMar-5-2024

The voluminous nature of geospatial temporal data from physical monitors and simulation models poses challenges to efficient data access, often resulting in cumbersome temporal selection experiences in web-based data portals. Thus, selecting a subset of time steps for prioritized visualization and pre-loading is highly desirable. Addressing this issue, this paper establishes a multifaceted definition of salient time steps via extensive need-finding studies with domain experts to understand their workflows. Building on this, we propose a novel approach that leverages autoencoders and dynamic programming to facilitate user-driven temporal selections. Structural features, statistical variations, and distance penalties are incorporated to make more flexible selections. User-specified priorities, spatial regions, and aggregations are used to combine different perspectives. We design and implement a web-based interface to enable efficient and context-aware selection of time steps and evaluate its efficacy and usability through case studies, quantitative evaluations, and expert interviews.

data mining, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

doi: 10.1145/3613904.3642944

2403.03449

Country: North America > United States > Colorado (0.14)

Genre:

Research Report (1.00)
Personal > Interview (0.66)

Industry: Government > Regional Government > North America Government > United States Government (1.00)

Technology:

Information Technology > Visualization (1.00)
Information Technology > Human Computer Interaction (1.00)
Information Technology > Data Science > Data Mining (1.00)
(6 more...)

Add feedback