location
Agents of Change: Self-Evolving LLM Agents for Strategic Planning
Belle, Nikolas, Barnes, Dakota, Amayuelas, Alfonso, Bercovich, Ivan, Wang, Xin Eric, Wang, William
We address the long-horizon gap in large language model (LLM) agents by enabling them to sustain coherent strategies in adversarial, stochastic environments. Settlers of Catan provides a challenging benchmark: success depends on balancing short- and long-term goals amid randomness, trading, expansion, and blocking. Prompt-centric LLM agents (e.g., ReAct, Reflexion) must re-interpret large, evolving game states each turn, quickly saturating context windows and losing strategic consistency. We propose HexMachina, a continual learning multi-agent system that separates environment discovery (inducing an adapter layer without documentation) from strategy improvement (evolving a compiled player through code refinement and simulation). This design preserves executable artifacts, allowing the LLM to focus on high-level strategy rather than per-turn reasoning. In controlled Catanatron experiments, HexMachina learns from scratch and evolves players that outperform the strongest human-crafted baseline (AlphaBeta), achieving a 54% win rate and surpassing prompt-driven and no-discovery baselines. Ablations confirm that isolating pure strategy learning improves performance. Overall, artifact-centric continual learning transforms LLMs from brittle stepwise deciders into stable strategy designers, advancing long-horizon autonomy.
Failure Makes the Agent Stronger: Enhancing Accuracy through Structured Reflection for Reliable Tool Interactions
Su, Junhao, Wan, Yuanliang, Yang, Junwei, Shi, Hengyu, Han, Tianyang, Luo, Junfeng, Qiu, Yurui
Tool-augmented large language models (LLMs) are usually trained with supervised imitation or coarse-grained reinforcement learning that optimizes single tool calls. Current self-reflection practices rely on heuristic prompts or one-way reasoning: the model is urged to 'think more' instead of learning error diagnosis and repair. This is fragile in multi-turn interactions; after a failure the model often repeats the same mistake. We propose structured reflection, which turns the path from error to repair into an explicit, controllable, and trainable action. The agent produces a short yet precise reflection: it diagnoses the failure using evidence from the previous step and then proposes a correct, executable follow-up call. For training we combine DAPO and GSPO objectives with a reward scheme tailored to tool use, optimizing the stepwise strategy Reflect, then Call, then Final. To evaluate, we introduce Tool-Reflection-Bench, a lightweight benchmark that programmatically checks structural validity, executability, parameter correctness, and result consistency. Tasks are built as mini trajectories of erroneous call, reflection, and corrected call, with disjoint train and test splits. Experiments on BFCL v3 and Tool-Reflection-Bench show large gains in multi-turn tool-call success and error recovery, and a reduction of redundant calls. These results indicate that making reflection explicit and optimizing it directly improves the reliability of tool interaction and offers a reproducible path for agents to learn from failure.
Echoes of Biases: How Stigmatizing Language Affects AI Performance
Liu, Yizhi, Wang, Weiguang, Gao, Guodong Gordon, Agarwal, Ritu
Electronic health records (EHRs) serve as an essential data source for the envisioned artificial intelligence (AI)-driven transformation in healthcare. However, clinician biases reflected in EHR notes can lead to AI models inheriting and amplifying these biases, perpetuating health disparities. This study investigates the impact of stigmatizing language (SL) in EHR notes on mortality prediction using a Transformer-based deep learning model and explainable AI (XAI) techniques. Our findings demonstrate that SL written by clinicians adversely affects AI performance, particularly so for black patients, highlighting SL as a source of racial disparity in AI model development. To explore an operationally efficient way to mitigate SL's impact, we investigate patterns in the generation of SL through a clinicians' collaborative network, identifying central clinicians as having a stronger impact on racial disparity in the AI model. We find that removing SL written by central clinicians is a more efficient bias reduction strategy than eliminating all SL in the entire corpus of data. This study provides actionable insights for responsible AI development and contributes to understanding clinician behavior and EHR note writing in healthcare.
Analysis of ChatGPT on Source Code
Sadik, Ahmed R., Ceravola, Antonello, Joublin, Frank, Patra, Jibesh
This paper explores the use of Large Language Models (LLMs) and in particular ChatGPT in programming, source code analysis, and code generation. LLMs and ChatGPT are built using machine learning and artificial intelligence techniques, and they offer several benefits to developers and programmers. While these models can save time and provide highly accurate results, they are not yet advanced enough to replace human programmers entirely. The paper investigates the potential applications of LLMs and ChatGPT in various areas, such as code creation, code documentation, bug detection, refactoring, and more. The paper also suggests that the usage of LLMs and ChatGPT is expected to increase in the future as they offer unparalleled benefits to the programming community.
Schema Encoding for Transferable Dialogue State Tracking
Jeon, Hyunmin, Lee, Gary Geunbae
Dialogue state tracking (DST) is an essential sub-task for task-oriented dialogue systems. Recent work has focused on deep neural models for DST. However, the neural models require a large dataset for training. Furthermore, applying them to another domain needs a new dataset because the neural models are generally trained to imitate the given dataset. In this paper, we propose Schema Encoding for Transferable Dialogue State Tracking (SETDST), which is a neural DST method for effective transfer to new domains. Transferable DST could assist developments of dialogue systems even with few dataset on target domains. We use a schema encoder not just to imitate the dataset but to comprehend the schema of the dataset. We aim to transfer the model to new domains by encoding new schemas and using them for DST on multi-domain settings. As a result, SET-DST improved the joint accuracy by 1.46 points on MultiWOZ 2.1.
New analytical tool locates shooters using smartphone video
Researchers at Carnegie Mellon University have developed a system that can accurately locate a shooter based on video recordings from as few as three smartphones. When demonstrated using three video recordings from the 2017 mass shooting in Las Vegas that left 58 people dead and hundreds wounded, the system correctly estimated the shooter's actual location--the north wing of the Mandalay Bay hotel. The estimate was based on three gunshots fired within the first minute of what would be a prolonged massacre. Alexander Hauptmann, research professor in CMU's Language Technologies Institute, said the system, called Video Event Reconstruction and Analysis (VERA), won't necessarily replace the commercial microphone arrays for locating shooters that public safety officials already use, although it may be a useful supplement for public safety when commercial arrays aren't available. One key motivation for assembling VERA was to create a tool that could be used by human rights workers and journalists who investigate war crimes, terrorist acts and human rights violations, Hauptmann said.
The robot that will save you a parking space on Black Friday
Days of circling the parking lot looking for a space may be a thing of the past. A new service, called MyPark, aims to make finding a spot much easier by letting users reserve it beforehand. When a spot is reserved, MyPark'automatically knows your there' and deploys a smartphone-controlled robot to make sure no one takes your place. Days of circling the parking lot looking for a space may be a thing of the past. Download the MyPark app and make an account.
The robot that will save you a parking space on Black Friday
Days of circling the parking lot looking for a space may be a thing of the past. A new service, called MyPark, aims to make finding a spot much easier by letting users reserve it beforehand. When a spot is reserved, MyPark'automatically knows your there' and deploys a smartphone-controlled robot to make sure no one takes your place. Days of circling the parking lot looking for a space may be a thing of the past. Download the MyPark app and make an account.
Can Machine Learning Turn Big Data into No Big Deal?
With technology moving so fast, new ways to automate, and connected machines, how can managers and engineers simplify the complexity of that ecosystem? Is machine learning (ML) or artificial intelligence (AI) the key? This article will define some buzzwords, what they mean, and if they might help simplify these complex technologies so that you can move back into production. New technologies, such as Big Data and the Industrial Internet of Things, are gaining more traction. While security is a concern, some companies push ahead because the benefits are too great.
The Find-the-Remote Event
The Find-the-Remote event was considered the most challenging of the events in the 1997 AAAI Mobile Robot Competition and Exhibition. It required a broad range of both hardware and software capabilities. I discuss the rules and rationale for the event as well as the results. It involved fetching a known set of objects from unknown, but constrained, locations in a known environment. In real life, such functions might be useful for in-home care of the elderly or the physically disabled. This event was extremely difficult because it forced teams to implement both manipulation (the grasping and moving of objects) and visual object recognition. Furthermore, it explicitly required teams to implement them for a wide range of objects. It therefore eliminated a broad range of special-purpose sensing and manipulation strategies that would be specific to one or another class of objects. It also required that objects be lifted from a variety of surfaces (real furniture) at a variety of heights.