AITopics | Instructional Material

Collaborating Authors

Instructional Material

FEABench: Evaluating Language Models on Multiphysics Reasoning Ability

Mudur, Nayantara, Cui, Hao, Venugopalan, Subhashini, Raccuglia, Paul, Brenner, Michael P., Norgaard, Peter

arXiv.org Artificial IntelligenceApr-9-2025

Building precise simulations of the real world and invoking numerical solvers to answer quantitative problems is an essential requirement in engineering and science. We present FEABench, a benchmark to evaluate the ability of large language models (LLMs) and LLM agents to simulate and solve physics, mathematics and engineering problems using finite element analysis (FEA). We introduce a comprehensive evaluation scheme to investigate the ability of LLMs to solve these problems end-to-end by reasoning over natural language problem descriptions and operating COMSOL Multiphysics$^\circledR$, an FEA software, to compute the answers. We additionally design a language model agent equipped with the ability to interact with the software through its Application Programming Interface (API), examine its outputs and use tools to improve its solutions over multiple iterations. Our best performing strategy generates executable API calls 88% of the time. LLMs that can successfully interact with and operate FEA software to solve problems such as those in our benchmark would push the frontiers of automation in engineering. Acquiring this capability would augment LLMs' reasoning skills with the precision of numerical solvers and advance the development of autonomous systems that can tackle complex problems in the real world. The code is available at https://github.com/google/feabench

large language model, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2504.0626

Country: Europe (0.46)

Genre:

Research Report (1.00)
Workflow (0.92)
Instructional Material > Course Syllabus & Notes (0.67)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.93)

Add feedback

STRIVE: A Think & Improve Approach with Iterative Refinement for Enhancing Question Quality Estimation

Deroy, Aniket, Maity, Subhankar

arXiv.org Artificial IntelligenceApr-9-2025

Automatically assessing question quality is crucial for educators as it saves time, ensures consistency, and provides immediate feedback for refining teaching materials. We propose a novel methodology called STRIVE (Structured Thinking and Refinement with multiLLMs for Improving Verified Question Estimation) using a series of Large Language Models (LLMs) for automatic question evaluation. This approach aims to improve the accuracy and depth of question quality assessment, ultimately supporting diverse learners and enhancing educational practices. The method estimates question quality in an automated manner by generating multiple evaluations based on the strengths and weaknesses of the provided question and then choosing the best solution generated by the LLM. Then the process is improved by iterative review and response with another LLM until the evaluation metric values converge. This sophisticated method of evaluating question quality improves the estimation of question quality by automating the task of question quality evaluation. Correlation scores show that using this proposed method helps to improve correlation with human judgments compared to the baseline method. Error analysis shows that metrics like relevance and appropriateness improve significantly relative to human judgments by using STRIVE.

baseline approach, large language model, natural language, (16 more...)

arXiv.org Artificial Intelligence

2504.05693

Country:

North America > United States (0.48)
Asia > India > West Bengal (0.14)

Genre:

Research Report (0.50)
Instructional Material (0.34)

Industry: Education (0.69)

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

Add feedback

EduPlanner: LLM-Based Multi-Agent Systems for Customized and Intelligent Instructional Design

Zhang, Xueqiao, Zhang, Chao, Sun, Jianwen, Xiao, Jun, Yang, Yi, Luo, Yawei

arXiv.org Artificial IntelligenceApr-9-2025

Large Language Models (LLMs) have significantly advanced smart education in the Artificial General Intelligence (AGI) era. A promising application lies in the automatic generalization of instructional design for curriculum and learning activities, focusing on two key aspects: (1) Customized Generation: generating niche-targeted teaching content based on students' varying learning abilities and states, and (2) Intelligent Optimization: iteratively optimizing content based on feedback from learning effectiveness or test scores. Currently, a single large LLM cannot effectively manage the entire process, posing a challenge for designing intelligent teaching plans. To address these issues, we developed EduPlanner, an LLM-based multi-agent system comprising an evaluator agent, an optimizer agent, and a question analyst, working in adversarial collaboration to generate customized and intelligent instructional design for curriculum and learning activities. Taking mathematics lessons as our example, EduPlanner employs a novel Skill-Tree structure to accurately model the background mathematics knowledge of student groups, personalizing instructional design for curriculum and learning activities according to students' knowledge levels and learning abilities. Additionally, we introduce the CIDDP, an LLM-based five-dimensional evaluation module encompassing clarity, Integrity, Depth, Practicality, and Pertinence, to comprehensively assess mathematics lesson plan quality and bootstrap intelligent optimization. Experiments conducted on the GSM8K and Algebra datasets demonstrate that EduPlanner excels in evaluating and optimizing instructional design for curriculum and learning activities. Ablation studies further validate the significance and effectiveness of each component within the framework. Our code is publicly available at https://github.com/Zc0812/Edu_Planner

large language model, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2504.0537

Country: Asia > China (0.68)

Genre:

Instructional Material (1.00)
Research Report > New Finding (0.46)

Industry:

Education > Educational Setting (0.93)
Education > Curriculum > Subject-Specific Education (0.68)
Education > Assessment & Standards > Student Performance (0.48)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

The Role of Environment Access in Agnostic Reinforcement Learning

Krishnamurthy, Akshay, Li, Gene, Sekhari, Ayush

arXiv.org Machine LearningApr-7-2025

We study Reinforcement Learning (RL) in environments with large state spaces, where function approximation is required for sample-efficient learning. Departing from a long history of prior work, we consider the weakest possible form of function approximation, called agnostic policy learning, where the learner seeks to find the best policy in a given class $\Pi$, with no guarantee that $\Pi$ contains an optimal policy for the underlying task. Although it is known that sample-efficient agnostic policy learning is not possible in the standard online RL setting without further assumptions, we investigate the extent to which this can be overcome with stronger forms of access to the environment. Specifically, we show that: 1. Agnostic policy learning remains statistically intractable when given access to a local simulator, from which one can reset to any previously seen state. This result holds even when the policy class is realizable, and stands in contrast to a positive result of [MFR24] showing that value-based learning under realizability is tractable with local simulator access. 2. Agnostic policy learning remains statistically intractable when given online access to a reset distribution with good coverage properties over the state space (the so-called $\mu$-reset setting). We also study stronger forms of function approximation for policy learning, showing that PSDP [BKSN03] and CPI [KL02] provably fail in the absence of policy completeness. 3. On a positive note, agnostic policy learning is statistically tractable for Block MDPs with access to both of the above reset models. We establish this via a new algorithm that carefully constructs a policy emulator: a tabular MDP with a small state space that approximates the value functions of all policies $\pi \in \Pi$. These values are approximated without any explicit value function class.

lat, machine learning, reinforcement learning, (17 more...)

arXiv.org Machine Learning

2504.05405

Country:

Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
Asia > Middle East > Jordan (0.04)
North America > United States > Washington > King County > Seattle (0.04)
Europe > United Kingdom > England > Greater London > London (0.04)

Genre:

Research Report (1.00)
Instructional Material > Course Syllabus & Notes (0.45)

Industry: Education (0.45)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.45)

Add feedback

Agentic Large Language Models, a survey

Plaat, Aske, van Duijn, Max, van Stein, Niki, Preuss, Mike, van der Putten, Peter, Batenburg, Kees Joost

arXiv.org Artificial IntelligenceApr-3-2025

There is great interest in agentic LLMs, large language models that act as agents. We review the growing body of work in this area and provide a research agenda. Agentic LLMs are LLMs that (1) reason, (2) act, and (3) interact. We organize the literature according to these three categories. The research in the first category focuses on reasoning, reflection, and retrieval, aiming to improve decision making; the second category focuses on action models, robots, and tools, aiming for agents that act as useful assistants; the third category focuses on multi-agent systems, aiming for collaborative task solving and simulating interaction to study emergent social behavior. We find that works mutually benefit from results in other categories: retrieval enables tool use, reflection improves multi-agent collaboration, and reasoning benefits all categories. We discuss applications of agentic LLMs and provide an agenda for further research. Important applications are in medical diagnosis, logistics and financial market analysis. Meanwhile, self-reflective agents playing roles and interacting with one another augment the process of scientific research itself. Further, agentic LLMs may provide a solution for the problem of LLMs running out of training data: inference-time behavior generates new training states, such that LLMs can keep learning without needing ever larger datasets. We note that there is risk associated with LLM assistants taking action in the real world, while agentic LLMs are also likely to benefit society.

arxiv preprint arxiv, large language model, machine learning, (17 more...)

arXiv.org Artificial Intelligence

2503.23037

Country:

Europe > Netherlands > South Holland > Leiden (0.04)
Asia > Myanmar > Tanintharyi Region > Dawei (0.04)
Asia > Middle East > Jordan (0.04)
(12 more...)

Genre:

Research Report > New Finding (1.00)
Overview (1.00)
Instructional Material (1.00)
Research Report > Experimental Study (0.92)

Industry:

Transportation > Air (1.00)
Leisure & Entertainment > Games > Computer Games (1.00)
Information Technology (1.00)
(3 more...)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Interview with Joseph Marvin Imperial: aligning generative AI with technical standards

AIHubApr-2-2025, 08:03:28 GMT

In this interview series, we're meeting some of the AAAI/SIGAI Doctoral Consortium participants to find out more about their research. The Doctoral Consortium provides an opportunity for a group of PhD students to discuss and explore their research interests and career objectives in an interdisciplinary workshop together with a panel of established researchers. In the latest interview, we hear from Joseph Marvin Imperial, who is focussed on aligning generative AI with technical standards for regulatory and operational compliance. Standards are documents created by industry and/or academic experts that have been recognized to ensure the quality, accuracy, and interoperability of systems and processes (aka "the best way of doing things"). You'll see standards in almost all sectors and domains, including the sciences, healthcare, education, finance, journalism, law, and engineering.

doctoral consortium, generative ai, joseph marvin imperial, (12 more...)

AIHub

Country:

Europe > United Kingdom (0.15)
Asia > Philippines (0.05)

Genre: Instructional Material (0.35)

Industry:

Health & Medicine (0.51)
Media > News (0.35)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning > Generative AI (0.62)

Add feedback

VizFlyt: Perception-centric Pedagogical Framework For Autonomous Aerial Robots

Srivastava, Kushagra, Kulkarni, Rutwik, Velmurugan, Manoj, Sanket, Nitin J.

arXiv.org Artificial IntelligenceApr-1-2025

All the images in this paper are best viewed in color on a computer screen at 200% zoom. Abstract -- Autonomous aerial robots are becoming commonplace in our lives. Hands-on aerial robotics courses are pivotal in training the next-generation workforce to meet the growing market demands. Such an efficient and compelling course depends on a reliable testbed. We utilize pose from an external localization system to hallucinate real-time and photorealistic visual sensors using 3D Gaussian Splatting. This enables stress-free testing of autonomy algorithms on aerial robots without the risk of crashing into obstacles. We achieve over 100Hz of system update rate. Lastly, we build upon our past experiences of offering hands-on aerial robotics courses and propose a new open-source and open-hardware curriculum based on VizFlyt for the future. We test our framework on various course projects in real-world HITL experiments and present the results showing the efficacy of such a system and its large potential use cases. Code, datasets, hardware guides and demo videos are available at https://pear .wpi.edu/research/vizflyt.html

artificial intelligence, robot, student, (17 more...)

arXiv.org Artificial Intelligence

2503.22876

Country:

North America > United States > Ohio (0.04)
Asia > Middle East > Republic of Türkiye > Karaman Province > Karaman (0.04)
North America > United States > Pennsylvania (0.04)
(5 more...)

Genre:

Research Report (1.00)
Instructional Material > Course Syllabus & Notes (1.00)

Industry:

Transportation > Air (1.00)
Information Technology (1.00)
Education (1.00)
Health & Medicine (0.66)

Technology: Information Technology > Artificial Intelligence > Robots > Autonomous Vehicles > Drones (1.00)

Add feedback

CyberBOT: Towards Reliable Cybersecurity Education via Ontology-Grounded Retrieval Augmented Generation

Zhao, Chengshuai, De Maria, Riccardo, Kumarage, Tharindu, Chaudhary, Kumar Satvik, Agrawal, Garima, Li, Yiwen, Park, Jongchan, Deng, Yuli, Chen, Ying-Chih, Liu, Huan

arXiv.org Artificial IntelligenceMar-31-2025

Advancements in large language models (LLMs) have enabled the development of intelligent educational tools that support inquiry-based learning across technical domains. In cybersecurity education, where accuracy and safety are paramount, systems must go beyond surface-level relevance to provide information that is both trustworthy and domain-appropriate. To address this challenge, we introduce CyberBOT, a question-answering chatbot that leverages a retrieval-augmented generation (RAG) pipeline to incorporate contextual information from course-specific materials and validate responses using a domain-specific cybersecurity ontology. The ontology serves as a structured reasoning layer that constrains and verifies LLM-generated answers, reducing the risk of misleading or unsafe guidance. CyberBOT has been deployed in a large graduate-level course at Arizona State University (ASU), where more than one hundred students actively engage with the system through a dedicated web-based platform. Computational evaluations in lab environments highlight the potential capacity of CyberBOT, and a forthcoming field study will evaluate its pedagogical impact. By integrating structured domain reasoning with modern generative capabilities, CyberBOT illustrates a promising direction for developing reliable and curriculum-aligned AI applications in specialized educational contexts.

large language model, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2504.00389

Country:

North America > United States > Arizona (0.25)
Europe > France (0.04)
South America > Uruguay > Maldonado > Maldonado (0.04)
North America > United States > Virginia (0.04)

Genre:

Research Report > Experimental Study (1.00)
Instructional Material > Course Syllabus & Notes (0.93)

Industry:

Information Technology > Security & Privacy (1.00)
Education > Educational Technology > Educational Software (0.93)
Government > Military > Cyberwarfare (0.86)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Rubric Is All You Need: Enhancing LLM-based Code Evaluation With Question-Specific Rubrics

Pathak, Aditya, Gandhi, Rachit, Uttam, Vaibhav, Devansh, null, Nakka, Yashwanth, Jindal, Aaryan Raj, Ghosh, Pratyush, Ramamoorthy, Arnav, Verma, Shreyash, Mittal, Aditya, Ased, Aashna, Khatri, Chirag, Challa, Jagat Sesh, Kumar, Dhruv

arXiv.org Artificial IntelligenceMar-31-2025

Since the disruption in LLM technology brought about by the release of GPT-3 and ChatGPT, LLMs have shown remarkable promise in programming-related tasks. While code generation remains a popular field of research, code evaluation using LLMs remains a problem with no conclusive solution. In this paper, we focus on LLM-based code evaluation and attempt to fill in the existing gaps. We propose multi-agentic novel approaches using question-specific rubrics tailored to the problem statement, arguing that these perform better for logical assessment than the existing approaches that use question-agnostic rubrics. To address the lack of suitable evaluation datasets, we introduce two datasets: a Data Structures and Algorithms dataset containing 150 student submissions from a popular Data Structures and Algorithms practice website, and an Object Oriented Programming dataset comprising 80 student submissions from undergraduate computer science courses. In addition to using standard metrics (Spearman Correlation, Cohen's Kappa), we additionally propose a new metric called as Leniency, which quantifies evaluation strictness relative to expert assessment. Our comprehensive analysis demonstrates that question-specific rubrics significantly enhance logical assessment of code in educational settings, providing better feedback aligned with instructional goals beyond mere syntactic correctness.

large language model, machine learning, natural language, (15 more...)

arXiv.org Artificial Intelligence

2503.23989

Country:

North America > United States > Virginia > Albemarle County > Charlottesville (0.05)
Asia > India (0.05)
North America > United States > New York > New York County > New York City (0.05)
(8 more...)

Genre:

Instructional Material (1.00)
Research Report > New Finding (0.93)
Research Report > Promising Solution (0.66)

Industry:

Education > Curriculum > Subject-Specific Education (1.00)
Education > Assessment & Standards (0.93)
Education > Educational Setting (0.88)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.88)

Add feedback

If an LLM Were a Character, Would It Know Its Own Story? Evaluating Lifelong Learning in LLMs

Fan, Siqi, Huang, Xiusheng, Yao, Yiqun, Fang, Xuezhi, Liu, Kang, Han, Peng, Shang, Shuo, Sun, Aixin, Wang, Yequan

arXiv.org Artificial IntelligenceMar-30-2025

Large language models (LLMs) can carry out human-like dialogue, but unlike humans, they are stateless due to the superposition property. However, during multi-turn, multi-agent interactions, LLMs begin to exhibit consistent, character-like behaviors, hinting at a form of emergent lifelong learning. Despite this, existing benchmarks often fail to capture these dynamics, primarily focusing on static, open-ended evaluations. To address this gap, we introduce LIFESTATE-BENCH, a benchmark designed to assess lifelong learning in LLMs. It features two episodic datasets: Hamlet and a synthetic script collection, rich in narrative structure and character interactions. Our fact checking evaluation probes models' self-awareness, episodic memory retrieval, and relationship tracking, across both parametric and non-parametric approaches. Experiments on models like Llama3.1-8B, GPT-4-turbo, and DeepSeek R1, we demonstrate that nonparametric methods significantly outperform parametric ones in managing stateful learning. However, all models exhibit challenges with catastrophic forgetting as interactions extend, highlighting the need for further advancements in lifelong learning.

computational linguistic, large language model, machine learning, (19 more...)

arXiv.org Artificial Intelligence

2503.23514

Country:

North America > United States > California > San Francisco County > San Francisco (0.14)
Europe > Austria > Vienna (0.14)
Asia > Thailand > Bangkok > Bangkok (0.05)
(8 more...)

Genre:

Instructional Material (1.00)
Research Report > New Finding (0.46)

Industry: Education > Educational Setting > Continuing Education (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback