On the Evaluation of Engineering Artificial General Intelligence

Neema, Sandeep, Jha, Susmit, Nagel, Adam, Lew, Ethan, Sureshkumar, Chandrasekar, Gordic, Aleksa, Shimmin, Chase, Nguygen, Hieu, Eremenko, Paul

May-19-2025–arXiv.org Artificial Intelligence

W e discuss the challenges and propose a framework for evalua ting engineering artificial general intelligence ( eAGI) agents. W e consider eAGI as a specialization of artificial general intelligence (AGI), deemed capab le of addressing a broad range of problems in the engineering of physical systems and associated controllers. W e exclude software engineering for a tractable s coping of eAGI and expect dedicated software engineering AI agents to address the software implementation challenges. Similar to human engineers, eAGI agents should possess a unique blend of background knowledge (recall and retrieve) of facts and methods, demonstrate familiarity with tools and processes, exhibit deep understanding of industrial components and well-known design families, and be able to engage in creative problem solving (analyze and synthesize), transf erring ideas acquired in one context to another . Given this broad mandate, evaluatin g and qualifying the performance of eAGI agents is a challenge in itself and, arguably, a critical ena bler to developing eAGI agents. In this paper, we address this challenge by proposin g an extensible evaluation framework that specializes and gr ounds Bloom's taxonomy - a framework for evaluating human learning that has also been recently used for evaluating LLMs - in an engineering design context. Our p roposed framework advances the state of the art in benchmarking and evaluation of AI agents in terms of the following: (a) developing a rich taxonomy of evaluati on questions spanning from methodological knowledge to real-world design proble ms; (b) motivating a pluggable evaluation framework that can evaluate not only t extual responses but also evaluate structured design artifacts such as CAD model s and SysML models; and (c) outlining an automatable procedure to customize the evaluation benchmark to different engineering contexts.

artificial intelligence, large language model, natural language, (18 more...)

arXiv.org Artificial Intelligence

May-19-2025

arXiv.org PDF

Add feedback

Genre:
- Research Report (0.82)

Industry:
- Transportation (1.00)
- Electrical Industrial Apparatus (1.00)
- Education (1.00)
- Energy > Energy Storage (0.93)
- Aerospace & Defense (0.68)
- Automobiles & Trucks (0.67)

Technology:
- Information Technology > Artificial Intelligence
  - Cognitive Science > Problem Solving (0.67)
  - Representation & Reasoning > Agents (0.54)
  - Natural Language > Large Language Model (0.50)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found