Goto

Collaborating Authors

 miami


Structured Cognitive Loop for Behavioral Intelligence in Large Language Model Agents

Kim, Myung Ho

arXiv.org Artificial Intelligence

Large language models have advanced natural language understanding and generation, but their use as autonomous agents introduces architectural challenges for multi-step tasks. Existing frameworks often mix cognition, memory, and control in a single prompt, reducing coherence and predictability. The Structured Cognitive Loop (SCL) is proposed as an alternative architecture that separates these functions. In SCL, the language model handles cognition, memory is stored externally, and execution is guided by a lightweight controller within a goal-directed loop. This design allows intermediate results to be recorded and verified before actions are taken, improving traceability and evaluation. SCL is evaluated against prompt-based baselines such as ReAct and LangChain agents across three tasks: travel planning, conditional email drafting, and constraint-guided image generation. Under matched settings, SCL achieves an average task success rate of 86.3 percent, compared with 70.5 to 76.8 percent for baselines. It also shows higher goal fidelity, fewer redundant calls, and reduced unsupported assertions. These results indicate that separating cognition, memory, and control can enhance reliability and interpretability without relying on larger models or heavier prompts. The findings should be regarded as preliminary evidence, with broader tests across model families and task domains planned for future work.


TripTide: A Benchmark for Adaptive Travel Planning under Disruptions

Karmakar, Priyanshu, Chaudhuri, Soumyabrata, Mallick, Shubhojit, Gupta, Manish, Jana, Abhik, Ghosh, Shreya

arXiv.org Artificial Intelligence

Recent efforts like TripCraft and TravelPlanner have advanced the use of Large Language Models ( LLMs) for personalized, constraint aware travel itinerary generation. Yet, real travel often faces disruptions. To address this, we present TripTide, the first benchmark evaluating LLM's ability to revise itineraries under realistic disruptions. TripTide models key dimensions such as disruption severity and traveler tolerance, enabling nuanced assessment of LLM adaptability to events like flight cancellations, weather closures, or overbooked attractions. We conduct a threefold evaluation. First, we introduce automatic metrics including Preservation of Intent (how well the revised plan maintains feasibility and goals), Responsiveness (promptness and appropriateness of disruption handling), and Adaptability (semantic, spatial, and sequential divergence between original and revised plans). Second, we apply an LLM-as-a-judge approach to automatically assess revision quality. Third, we perform manual expert evaluation to verify whether revisions preserve semantic, spatial, sequential, and responsive aspects. Our experiments show that LLMs maintain strong sequential consistency and semantic stability, while spatial deviations are larger for shorter trips but decrease with longer ones, indicating that extended plans encourage better geographic coherence. However, disruption-handling ability declines as plan length increases, highlighting limits in LLM robustness. TripTide establishes a benchmark for evaluating adaptability, personalization, and resilience in LLM-based travel planning under real-world uncertainty.


Test It Before You Trust It: Applying Software Testing for Trustworthy In-context Learning

Racharak, Teeradaj, Ragkhitwetsagul, Chaiyong, Sontesadisai, Chommakorn, Sunetnanta, Thanwadee

arXiv.org Artificial Intelligence

In-context learning (ICL) has emerged as a powerful capability of large language models (LLMs), enabling them to perform new tasks based on a few provided examples without explicit fine-tuning. Despite their impressive adaptability, these models remain vulnerable to subtle adversarial perturbations and exhibit unpredictable behavior when faced with linguistic variations. Inspired by software testing principles, we introduce a software testing-inspired framework, called MMT4NL, for evaluating the trustworthiness of in-context learning by utilizing adversarial perturbations and software testing techniques. It includes diverse evaluation aspects of linguistic capabilities for testing the ICL capabilities of LLMs. MMT4NL is built around the idea of crafting metamorphic adversarial examples from a test set in order to quantify and pinpoint bugs in the designed prompts of ICL. Our philosophy is to treat any LLM as software and validate its functionalities just like testing the software. Finally, we demonstrate applications of MMT4NL on the sentiment analysis and question-answering tasks. Our experiments could reveal various linguistic bugs in state-of-the-art LLMs.


Towards Transparent AI: A Survey on Explainable Large Language Models

Palikhe, Avash, Yu, Zhenyu, Wang, Zichong, Zhang, Wenbin

arXiv.org Artificial Intelligence

Large Language Models (LLMs) have played a pivotal role in advancing Artificial Intelligence (AI). However, despite their achievements, LLMs often struggle to explain their decision-making processes, making them a 'black box' and presenting a substantial challenge to explainability. This lack of transparency poses a significant obstacle to the adoption of LLMs in high-stakes domain applications, where interpretability is particularly essential. To overcome these limitations, researchers have developed various explainable artificial intelligence (XAI) methods that provide human-interpretable explanations for LLMs. However, a systematic understanding of these methods remains limited. To address this gap, this survey provides a comprehensive review of explainability techniques by categorizing XAI methods based on the underlying transformer architectures of LLMs: encoder-only, decoder-only, and encoder-decoder models. Then these techniques are examined in terms of their evaluation for assessing explainability, and the survey further explores how these explanations are leveraged in practical applications. Finally, it discusses available resources, ongoing research challenges, and future directions, aiming to guide continued efforts toward developing transparent and responsible LLMs.


Doctor Who 'Lux' review: Hope can change the world

Engadget

It's an interesting time to be a long-running science fantasy media property in the streaming TV age. Star Trek is in the grip of an existential crisis as it (wrongly) fears it's too old-aged to be relevant. Star Wars became a battlefield in the culture war and, to duck all future bad faith criticism, gave us The Rise of Skywalker. And then there's Doctor Who, which is somehow managing to plough a 62-year furrow and still fill it with original ideas. This week the Doctor and Belinda go up against a sentient cartoon holding the patrons of a 1950s cinema hostage.


Source-primed Multi-turn Conversation Helps Large Language Models Translate Documents

Hu, Hanxu, Vamvas, Jannis, Sennrich, Rico

arXiv.org Artificial Intelligence

LLMs have paved the way for truly simple document-level machine translation, but challenges such as omission errors remain. In this paper, we study a simple method for handling document-level machine translation, by leveraging previous contexts in a multi-turn conversational manner. Specifically, by decomposing documents into segments and iteratively translating them while maintaining previous turns, this method ensures coherent translations without additional training, and can fully re-use the KV cache of previous turns thus minimizing computational overhead. We further propose a `source-primed' method that first provides the whole source document before multi-turn translation. We empirically show this multi-turn method outperforms both translating entire documents in a single turn and translating each segment independently according to multiple automatic metrics in representative LLMs, establishing a strong baseline for document-level translation using LLMs.


Waymo announces it's expanding to Miami

Engadget

Get ready to have that Will Smith song stuck in your head for the rest of the day because the autonomous taxi company Waymo is going to Miami. Waymo announced its plans to Miami on its official Waypoint blog. The expansion will start early next year as the company gets its fleet of self-driving Jaguar I-PACE EVs familiar with Miami's streets and intersections. Then in 2026, Waymo plans to start offering rides to customers through the Waymo One app. Waymo is also partnering with the African startup Moove as part of its expansion plans.


Methods of Automatic Matrix Language Determination for Code-Switched Speech

Iakovenko, Olga, Hain, Thomas

arXiv.org Artificial Intelligence

Code-switching (CS) is the process of speakers interchanging between two or more languages which in the modern world becomes increasingly common. In order to better describe CS speech the Matrix Language Frame (MLF) theory introduces the concept of a Matrix Language, which is the language that provides the grammatical structure for a CS utterance. In this work the MLF theory was used to develop systems for Matrix Language Identity (MLID) determination. The MLID of English/Mandarin and English/Spanish CS text and speech was compared to acoustic language identity (LID), which is a typical way to identify a language in monolingual utterances. MLID predictors from audio show higher correlation with the textual principles than LID in all cases while also outperforming LID in an MLID recognition task based on F1 macro (60%) and correlation score (0.38). This novel approach has identified that non-English languages (Mandarin and Spanish) are preferred over the English language as the ML contrary to the monolingual choice of LID.


Dating Apps Destroyed In-Person Romance. Now They're Trying to Revive It.

Slate

In the hour before the Chaotic Singles x Tinder dating event kicked off at the Moxy South Beach in Miami, the sky opened and the downpour began. The patrons of the nearby restaurant where I'd been dining were caught in the deluge, the rain soaking them as though they'd just swum in directly from Biscayne Bay. This perhaps had a cleansing effect--some sort of spiritual clean slate upon which to begin the night's mingling endeavor. But on a more literal level, it meant that the hotel's gorgeous rooftop would no longer be the venue for the night's icebreakers and hopeful attempts at romance. Instead, the event would be held in the lobby, alongside guests of the hotel.


Florida Middle Schoolers Arrested for Allegedly Creating Deepfake Nudes of Classmates

WIRED

Two teenage boys from Miami, Florida were arrested in December for allegedly creating and sharing AI-generated nude images of male and female classmates without consent, according to police reports obtained by WIRED via public record request. The arrest reports say the boys, aged 13 and 14, created the images of the students who were "between the ages of 12 and 13." The Florida case appears to be the first arrests and criminal charges as a result of sharing AI-generated nude images to come to light. The boys were charged with third-degree felonies--the same level of crimes such as grand theft auto or false imprisonment--under a state law passed in 2022. It makes it a felony to share "any altered sexual depiction" of a person without their consent.