AITopics

Embodied decision-making enables agents to translate high-level goals into executable actions through continuous interactions within the physical world, forming a cornerstone of general-purpose embodied intelligence. Large language models (LLMs), with their general decision-making capabilities, offer a promising path to realize this potential; however, LLMs trained solely on language lack exposure to physical environments, limiting their true embodied understanding. To bridge this gap, we propose the concept of a training ground: a comprehensive infrastructure that provides task and scene simulation, embodied interaction, and feedback signals, offering a one-stop solution for LLM acquire genuine embodied decision-making skills. In this work, we present EmboMatrix, the first training ground of its kind, providing massive and diverse tasks with efficient simulation and precise rewards. EmboMatrix incorporates a series of novel techniques: a multi-agent data engine for large-scale task and scene generation, a distributed heterogeneous-hardware system for scalable simulation, and a multi-level reward architecture for precise supervision. Leveraging EmboMatrix, we cultivate EmboBrain, an LLM whose embodied decision-making abilities emerge from extensive embodied interactions. Experiments show that EmboBrain-7B surpasses the 671B DeepSeek-R1 baseline by 9.5\% on two challenging embodied decision-making benchmarks, demonstrating the power of interactive, environment-grounded learning for building truly intelligent embodied agents.

large language model, machine learning, natural language, (18 more...)

2510.12072

Country:

North America > United States (0.46)
Europe > Austria > Vienna (0.14)

Genre: Research Report (1.00)

Industry: Education (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.89)

Lee, Jin Hwa, Smith, Matthew, Adam, Maxwell, Hoogland, Jesse

Influence Dynamics and Stagewise Data Attribution

Current training data attribution (TDA) methods treat the influence one sample has on another as static, but neural networks learn in distinct stages that exhibit changing patterns of influence. In this work, we introduce a framework for stagewise data attribution grounded in singular learning theory. We predict that influence can change non-monotonically, including sign flips and sharp peaks at developmental transitions. We first validate these predictions analytically and empirically in a toy model, showing that dynamic shifts in influence directly map to the model's progressive learning of a semantic hierarchy. Finally, we demonstrate these phenomena at scale in language models, where token-level influence changes align with known developmental stages.

artificial intelligence, machine learning, natural language, (14 more...)

2510.12071

Country: North America > United States (0.46)

Genre: Research Report > New Finding (0.46)

Industry: Education (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

ThinkPilot: Steering Reasoning Models via Automated Think-prefixes Optimization

Li, Sunzhu, Lin, Zhiyu, Yang, Shuling, Zhao, Jiale, Chen, Wei

Large Reasoning Models (LRMs) are powerful, but they still suffer from inefficient and off-target reasoning. Currently, training-free methods are limited to either rigid heuristics or descriptive, non-actionable analyses. In this paper, we introduce ThinkPilot, a training-free framework that automatically optimizes LRMs reasoning. It uses an evolutionary process to generate think-prefixes, which are instructions that evolve driven by a taxonomy of reasoning behaviors to guide models toward superior performance. Extensive experiments demonstrate ThinkPilot's broad effectiveness: it significantly improves the accuracy-length trade-off for efficient reasoning, drastically improves safety (for example, cutting the StrongREJECT score of DeepSeek-R1-Distill-Qwen-32B from 27.0% to 0.7), and enhances instruction following. It also synergizes with existing training-based methods. Our analysis reveals that think-prefixes can reliably control LRMs' reasoning behaviors, and that different tasks have strong preferences for specific behavioral distributions. By automatically identifying and eliciting these behaviors, ThinkPilot provides a generalizable framework for aligning LRMs reasoning with task demands. Data and code are available at https://github.com/teqkilla/ThinkPilot

arxiv preprint arxiv, large language model, machine learning, (19 more...)

2510.12063

Country: Asia > China (0.28)

Genre: Research Report > New Finding (0.93)

Industry: Education (0.67)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Cognitive Science > Problem Solving (0.93)
(2 more...)

Generate Logical Equivalence Questions

Wang, Xinyu, Yu, Haoming, Yang, Yicheng, Li, Zhiyuan

Academic dishonesty is met with zero tolerance in higher education, yet plagiarism has become increasingly prevalent in the era of online teaching and learning. Automatic Question Generation (AQG) presents a potential solution to mitigate copying by creating unique questions for each student. Additionally, AQG can provide a vast array of practice questions. Our AQG focuses on generating logical equivalence questions for Discrete Mathematics, a foundational course for first-year computer science students. A literature review reveals that existing AQGs for this type of question generate all propositions that meet user-defined constraints, resulting in inefficiencies and a lack of uniform question difficulty. To address this, we propose a new approach that defines logical equivalence questions using a formal language, translates this language into two sets of generation rules, and develops a linear-time algorithm for question generation. We evaluated our AQG through two experiments. The first involved a group of students completing questions generated by our system. Statistical analysis shows that the accuracy of these questions is comparable to that of textbook questions. The second experiment assessed the number of steps required to solve our generated questions, textbook questions, and those generated by multiple large language models. The results indicated that the difficulty of our questions was similar to that of textbook questions, confirming the quality of our AQG.

artificial intelligence, logic & formal reasoning, natural language, (17 more...)

2510.12001

Country: Asia > China (0.14)

Genre: Research Report (1.00)

Industry:

Education > Educational Setting > Online (1.00)
Education > Educational Technology > Educational Software > Computer Based Training (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Logic & Formal Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)

Nonlinear discretizations and Newton's method: characterizing stationary points of regression objectives

Rowan, Conor

Second-order methods are emerging as promising alternatives to standard first-order optimizers such as gradient descent and ADAM for training neural networks. Though the advantages of including curvature information in computing optimization steps have been celebrated in the scientific machine learning literature, the only second-order methods that have been studied are quasi-Newton, meaning that the Hessian matrix of the objective function is approximated. Though one would expect only to gain from using the true Hessian in place of its approximation, we show that neural network training reliably fails when relying on exact curvature information. The failure modes provide insight both into the geometry of nonlinear discretizations as well as the distribution of stationary points in the loss landscape, leading us to question the conventional wisdom that the loss landscape is replete with local minima.

artificial intelligence, machine learning, neural network, (17 more...)

2510.11987

Country: North America > United States > Colorado (0.28)

Genre: Research Report (0.64)

Industry: Education > Curriculum > Subject-Specific Education (0.48)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Haller, Patrick, Ibrahim, Mark, Kirichenko, Polina, Sagun, Levent, Bell, Samuel J.

LLM Knowledge is Brittle: Truthfulness Representations Rely on Superficial Resemblance

For Large Language Models (LLMs) to be reliable, they must learn robust knowledge that can be generally applied in diverse settings -- often unlike those seen during training. Yet, extensive research has shown that LLM performance can be brittle, with models exhibiting excessive sensitivity to trivial input variations. In this work, we explore whether this brittleness is a direct result of unstable internal knowledge representations. To explore this question, we build on previous work showing that LLM representations encode statement truthfulness -- i.e., true, factual statements can be easily separated from false, inaccurate ones. Specifically, we test the robustness of learned knowledge by evaluating representation separability on samples that have undergone superficial transformations to drive them out-of-distribution (OOD), such as typos or reformulations. By applying semantically-preserving perturbations, we study how separability degrades as statements become more OOD, across four LLM families, five evaluation datasets, and three knowledge probing methods. Our results reveal that internal representations of statement truthfulness collapse as the samples' presentations become less similar to those seen during pre-training. While LLMs can often distinguish between true and false statements when they closely resemble the pre-training data, this ability is highly dependent on the statement's exact surface form. These findings offer a possible explanation for brittle benchmark performance: LLMs may learn shallow, non-robust knowledge representations that allow for only limited generalizability. Our work presents a fundamental challenge for the utility of truthfulness probes, and more broadly, calls for further research on improving the robustness of learned knowledge representations.

large language model, machine learning, natural language, (19 more...)

2510.11905

Country:

North America > United States (0.28)
North America > Mexico (0.28)
Asia > Middle East (0.28)
Europe > Austria (0.28)

Genre: Research Report > New Finding (1.00)

Industry: Education > Curriculum > Subject-Specific Education (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.49)

Artificial Intelligence for Optimal Learning: A Comparative Approach towards AI-Enhanced Learning Environments

Hariharan, Ananth

In the rapidly evolving educational landscape, the integration of technology has shifted from an enhancement to a cornerstone of educational strategy worldwide. This transition is propelled by advancements in digital technology, especially the emergence of artificial intelligence as a crucial tool in learning environments. This research project critically evaluates the impact of three distinct educational settings: traditional educational methods without technological integration, those enhanced by non-AI technology, and those utilising AI-driven technologies. This comparison aims to assess how each environment influences educational outcomes, engagement, pedagogical methods, and equity in access to learning resources, and how each contributes uniquely to the learning experience. The ultimate goal of this research is to synthesise the strengths of each model to create a more holistic educational approach. By integrating the personal interaction and tested pedagogical techniques of traditional classrooms, the enhanced accessibility and collaborative tools offered by non-AI technology, and the personalised, adaptive learning strategies enabled by AI-driven technologies, education systems can develop richer, more effective learning environments. This hybrid approach aims to leverage the best elements of each setting, thereby enhancing educational outcomes, engagement, and inclusiveness, while also addressing the distinct challenges and limitations inherent in each model. The intention is to create an educational framework deeply attentive to the diverse needs of students, ensuring equitable access to high-quality education for all.

artificial intelligence, machine learning, natural language, (18 more...)

2510.11755

Genre:

Research Report (1.00)
Instructional Material (1.00)

Industry:

Information Technology > Security & Privacy (1.00)
Education > Educational Setting > Online (0.93)
Education > Educational Technology > Educational Software > Computer Based Training (0.47)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Issues > Social & Ethical Issues (0.94)

Romani, Michele, Zanoni, Devis, Farella, Elisabetta, Turchet, Luca

BrainForm: a Serious Game for BCI Training and Data Collection

BrainForm is a gamified Brain-Computer Interface (BCI) training system designed for scalable data collection using consumer hardware and a minimal setup. We investigated (1) how users develop BCI control skills across repeated sessions and (2) perceptual and performance effects of two visual stimulation textures. Game Experience Questionnaire (GEQ) scores for Flow, Positive Affect, Competence and Challenge were strongly positive, indicating sustained engagement. A within-subject study with multiple runs, two task complexities, and post-session questionnaires revealed no significant performance differences between textures but increased ocular irritation over time. Online metrics--Task Accuracy, Task Time, and Information Transfer Rate--improved across sessions, confirming learning effects for symbol spelling, even under pressure conditions. Our results highlight the potential of BrainForm as a scalable, user-friendly BCI research tool and offer guidance for sustained engagement and reduced training fatigue.

artificial intelligence, machine learning, participant, (16 more...)

2510.10169

Country: Europe > Italy (0.14)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study > Negative Result (0.46)

Industry:

Leisure & Entertainment > Games > Computer Games (1.00)
Health & Medicine > Therapeutic Area > Neurology (0.94)
Education (0.93)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Human Computer Interaction (0.94)

Zhang, Brian Hu, Anagnostides, Ioannis, Sandholm, Tuomas

Scale-Invariant Regret Matching and Online Learning with Optimal Convergence: Bridging Theory and Practice in Zero-Sum Games

A considerable chasm has been looming for decades between theory and practice in zero-sum game solving through first-order methods. Although a convergence rate of $T^{-1}$ has long been established since Nemirovski's mirror-prox algorithm and Nesterov's excessive gap technique in the early 2000s, the most effective paradigm in practice is *counterfactual regret minimization*, which is based on *regret matching* and its modern variants. In particular, the state of the art across most benchmarks is *predictive* regret matching$^+$ (PRM$^+$), in conjunction with non-uniform averaging. Yet, such algorithms can exhibit slower $Ω(T^{-1/2})$ convergence even in self-play. In this paper, we close the gap between theory and practice. We propose a new scale-invariant and parameter-free variant of PRM$^+$, which we call IREG-PRM$^+$. We show that it achieves $T^{-1/2}$ best-iterate and $T^{-1}$ (i.e., optimal) average-iterate convergence guarantees, while also being on par with PRM$^+$ on benchmark games. From a technical standpoint, we draw an analogy between IREG-PRM$^+$ and optimistic gradient descent with *adaptive* learning rate. The basic flaw of PRM$^+$ is that the ($\ell_2$-)norm of the regret vector -- which can be thought of as the inverse of the learning rate -- can decrease. By contrast, we design IREG-PRM$^+$ so as to maintain the invariance that the norm of the regret vector is nondecreasing. This enables us to derive an RVU-type bound for IREG-PRM$^+$, the first such property that does not rely on introducing additional hyperparameters to enforce smoothness. Furthermore, we find that IREG-PRM$^+$ performs on par with an adaptive version of optimistic gradient descent that we introduce whose learning rate depends on the misprediction error, demystifying the effectiveness of the regret matching family *vis-a-vis* more standard optimization techniques.

algorithm, artificial intelligence, machine learning, (18 more...)

2510.04407

Country: North America > United States (0.28)

Genre: Research Report (0.64)

Industry: Education > Educational Setting > Online (0.40)

Technology:

Information Technology > Game Theory (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.66)
(2 more...)

Graph2Eval: Automatic Multimodal Task Generation for Agents via Knowledge Graphs

Chen, Yurun, Hu, Xavier, Liu, Yuhan, Wang, Ziqi, Liao, Zeyi, Chen, Lin, Wei, Feng, Qian, Yuxi, Zheng, Bo, Yin, Keting, Zhang, Shengyu

As multimodal LLM-driven agents continue to advance in autonomy and generalization, evaluation based on static datasets can no longer adequately assess their true capabilities in dynamic environments and diverse tasks. Existing LLM-based synthetic data methods are largely designed for LLM training and evaluation, and thus cannot be directly applied to agent tasks that require tool use and interactive capabilities. While recent studies have explored automatic agent task generation with LLMs, most efforts remain limited to text or image analysis, without systematically modeling multi-step interactions in web environments. To address these challenges, we propose Graph2Eval, a knowledge graph-based framework that automatically generates both multimodal document comprehension tasks and web interaction tasks, enabling comprehensive evaluation of agents' reasoning, collaboration, and interactive capabilities. In our approach, knowledge graphs constructed from multi-source external data serve as the task space, where we translate semantic relations into structured multimodal tasks using subgraph sampling, task templates, and meta-paths. A multi-stage filtering pipeline based on node reachability, LLM scoring, and similarity analysis is applied to guarantee the quality and executability of the generated tasks. Furthermore, Graph2Eval supports end-to-end evaluation of multiple agent types (Single-Agent, Multi-Agent, Web Agent) and measures reasoning, collaboration, and interaction capabilities. We instantiate the framework with Graph2Eval-Bench, a curated dataset of 1,319 tasks spanning document comprehension and web interaction scenarios. Experiments show that Graph2Eval efficiently generates tasks that differentiate agent and model performance, revealing gaps in reasoning, collaboration, and web interaction across different settings and offering a new perspective for agent evaluation.

large language model, machine learning, natural language, (18 more...)

2510.00507

Genre: Research Report (1.00)

Industry:

Education (0.54)
Information Technology > Security & Privacy (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.95)