On the Evaluation of Engineering Artificial General Intelligence

Neema, Sandeep, Jha, Susmit, Nagel, Adam, Lew, Ethan, Sureshkumar, Chandrasekar, Gordic, Aleksa, Shimmin, Chase, Nguygen, Hieu, Eremenko, Paul

arXiv.org Artificial Intelligence 

W e discuss the challenges and propose a framework for evalua ting engineering artificial general intelligence ( eAGI) agents. W e consider eAGI as a specialization of artificial general intelligence (AGI), deemed capab le of addressing a broad range of problems in the engineering of physical systems and associated controllers. W e exclude software engineering for a tractable s coping of eAGI and expect dedicated software engineering AI agents to address the software implementation challenges. Similar to human engineers, eAGI agents should possess a unique blend of background knowledge (recall and retrieve) of facts and methods, demonstrate familiarity with tools and processes, exhibit deep understanding of industrial components and well-known design families, and be able to engage in creative problem solving (analyze and synthesize), transf erring ideas acquired in one context to another . Given this broad mandate, evaluatin g and qualifying the performance of eAGI agents is a challenge in itself and, arguably, a critical ena bler to developing eAGI agents. In this paper, we address this challenge by proposin g an extensible evaluation framework that specializes and gr ounds Bloom's taxonomy - a framework for evaluating human learning that has also been recently used for evaluating LLMs - in an engineering design context. Our p roposed framework advances the state of the art in benchmarking and evaluation of AI agents in terms of the following: (a) developing a rich taxonomy of evaluati on questions spanning from methodological knowledge to real-world design proble ms; (b) motivating a pluggable evaluation framework that can evaluate not only t extual responses but also evaluate structured design artifacts such as CAD model s and SysML models; and (c) outlining an automatable procedure to customize the evaluation benchmark to different engineering contexts.

Duplicate Docs Excel Report

Title
None found

Similar Docs  Excel Report  more

TitleSimilaritySource
None found