Can AI Master Construction Management (CM)? Benchmarking State-of-the-Art Large Language Models on CM Certification Exams

Xiong, Ruoxin, Wang, Yanyu, Gunhan, Suat, Zhu, Yimin, Berryman, Charles

Apr-15-2025–arXiv.org Artificial Intelligence

ABSTRACT The growing complexity of construction management (CM) projects, coupled with challenges such as strict regulatory requirements and labor shortages, requires specialized analytical tools that streamline project workflow and enhance performance. Although large language models (LLMs) have demonstrated exceptional performance in general reasoning tasks, their effectiveness in tackling CM-specific challenges, such as precise quantitative analysis and regulatory interpretation, remains inadequately explored. To bridge this gap, this study introduces CMExamSet, a comprehensive benchmarking dataset comprising 689 authentic multiple-choice questions sourced from 1 arXiv:2504.08779v1 The results indicate that GPT-4o and Claude 3.7 surpass typical human pass thresholds (70%), with average accuracies of 82% and 83%, respectively. Additionally, both models performed better on single-step tasks, with accuracies of 85.7% (GPT-4o) and 86.7% (Claude 3.7). Multi-step tasks were more challenging, reducing performance to 76.5% and 77.6%, respectively. Our error pattern analysis further reveals that conceptual misunderstandings are the most common (44.4% and 47.9%), underscoring the need for enhanced domain-specific reasoning models. These findings underscore the potential of LLMs as valuable supplementary analytical tools in CM, while highlighting the need for domain-specific refinements and sustained human oversight in complex decision making. INTRODUCTION The construction industry is undergoing a transformation driven by digital technologies, increased project complexity, heterogeneous regulations, and ongoing labor shortages (Abioye et al. 2021). These changes create a pressing need for intelligent tools that can augment human expertise and support decision-making in construction management (CM) (Regona et al. 2022). Among these technologies, large language models (LLMs) such as GPT-4 and Claude have shown a comparative performance in general reasoning, natural language understanding, and educational applications (Ooi et al. 2025).

accuracy, large language model, machine learning, (19 more...)

arXiv.org Artificial Intelligence

Apr-15-2025

arXiv.org PDF

Add feedback

Country:
- North America > United States > Louisiana > East Baton Rouge Parish > Baton Rouge (0.14)

Genre:
- Instructional Material (1.00)
- Research Report
  - New Finding (1.00)
  - Experimental Study (1.00)

Industry:
- Construction & Engineering (1.00)
- Education > Educational Setting
  - Online (0.46)

Technology:
- Information Technology > Artificial Intelligence
  - Natural Language > Large Language Model (1.00)
  - Machine Learning > Neural Networks
    - Deep Learning (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found