beginner
Evaluating Long-Context Reasoning in LLM-Based WebAgents
Chung, Andy, Zhang, Yichi, Lin, Kaixiang, Rawal, Aditya, Gao, Qiaozi, Chai, Joyce
As large language model (LLM)-based agents become increasingly integrated into daily digital interactions, their ability to reason across long interaction histories becomes crucial for providing personalized and contextually aware assistance. However, the performance of these agents in long context scenarios, particularly for action-taking WebAgents operating in realistic web environments, remains largely unexplored. This paper introduces a benchmark for evaluating long context reasoning capabilities of WebAgents through sequentially dependent subtasks that require retrieval and application of information from extended interaction histories. We develop a novel evaluation framework that simulates multi-session user interactions by injecting irrelevant task trajectories between dependent subtasks, creating contexts ranging from 25,000 to 150,000 tokens. Through extensive evaluation of four popular models, Claude-3.7, GPT-4.1, Llama 4, and o4-mini, we observe a dramatic performance degradation as context length increases, with success rates dropping from 40-50\% in baseline conditions to less than 10\% in long context scenarios. Our detailed error analysis reveals that agents primarily fail due to getting stuck in loops and losing track of original task objectives. We further propose an implicit RAG approach that provides modest improvements by generating task-relevant summaries, though fundamental limitations in long context reasoning persist. These findings highlight critical challenges for deploying WebAgents in realistic, long-term user interaction scenarios and provide insights for developing more robust agent architectures capable of maintaining coherent task execution across extended contexts.
- North America > The Bahamas (0.14)
- North America > United States > New York (0.04)
- North America > United States > Florida > Miami-Dade County > Miami (0.04)
- (11 more...)
- Workflow (0.93)
- Research Report > New Finding (0.93)
- Media (1.00)
- Consumer Products & Services (1.00)
- Transportation (0.93)
- Leisure & Entertainment > Sports > Basketball (0.46)
Animating Language Practice: Engagement with Stylized Conversational Agents in Japanese Learning
Rackauckas, Zackary, Hirschberg, Julia
We explore Jouzu, a Japanese language learning application that integrates large language models with anime-inspired conversational agents. Designed to address challenges learners face in practicing natural and expressive dialogue, Jouzu combines stylized character personas with expressive text-to-speech to create engaging conversational scenarios. We conducted a two-week in-the-wild deployment with 52 Japanese learners to examine how such stylized agents influence engagement and learner experience. Our findings show that participants interacted frequently and creatively, with advanced learners demonstrating greater use of expressive forms. Participants reported that the anime-inspired style made practice more enjoyable and encouraged experimenting with different registers. We discuss how stylization shapes willingness to engage, the role of affect in sustaining practice, and design opportunities for culturally grounded conversational AI in computer-assisted language learning (CALL). By framing our findings as an exploration of design and engagement, we highlight opportunities for generalization beyond Japanese contexts and contribute to international HCI scholarship.
- North America > United States > Hawaii > Honolulu County > Honolulu (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- Asia > Japan > Honshū > Kansai > Kyoto Prefecture > Kyoto (0.04)
Examining the Usage of Generative AI Models in Student Learning Activities for Software Programming
Chen, Rufeng, Jiang, Shuaishuai, Shen, Jiyun, Moon, AJung, Wei, Lili
Abstract--The rise of Generative AI (GenAI) tools like Chat-GPT has created new opportunities and challenges for computing education. Existing research has primarily focused on GenAI's ability to complete educational tasks and its impact on student performance, often overlooking its effects on knowledge gains. In this study, we investigate how GenAI assistance compares to conventional online resources in supporting knowledge gains across different proficiency levels. We conducted a controlled user experiment with 24 undergraduate students of two different levels of programming experience (beginner, intermediate) to examine how students interact with ChatGPT while solving programming tasks. We analyzed task performance, conceptual understanding, and interaction behaviors. Our findings reveal that generating complete solutions with GenAI significantly improves task performance, especially for beginners, but does not consistently result in knowledge gains. Importantly, usage strategies differ by experience: beginners tend to rely heavily on GenAI toward task completion often without knowledge gain in the process, while intermediates adopt more selective approaches. We find that both over-reliance and minimal use result in weaker knowledge gains overall. Based on our results, we call on students and educators to adopt GenAI as a learning rather than a problem solving tool. Our study highlights the urgent need for guidance when integrating GenAI into programming education to foster deeper understanding. The rapid development of Generative Artificial Intelligence (GenAI) has led to its widespread adoption across various domains to boost productivity and streamline workflows. Large Language Models (LLMs), such as OpenAI's ChatGPT and Codex, Google Gemini, and GitHub Copilot, have been integrated into domains including software engineering [1], [2], healthcare [3], education [4], creative writing [5], [6], and digital music [7], offering capabilities such as code generation, question answering, and image generation. These authors contributed equally to this work. Some studies evaluated GenAI's performance on programming tasks [8], user interface design education [9], and computer vision coursework [10]. Others focused on assessing the accuracy and usability of GenAIgenerated responses [11], [12].
- North America > Canada > Quebec > Montreal (0.15)
- North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.04)
- North America > United States > Colorado > Weld County > Evans (0.04)
- Research Report > New Finding (1.00)
- Research Report > Experimental Study (1.00)
- Education > Curriculum > Subject-Specific Education (0.68)
- Education > Educational Setting (0.48)
- Education > Educational Technology (0.46)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning > Generative AI (1.00)
When Truth Is Overridden: Uncovering the Internal Origins of Sycophancy in Large Language Models
Wang, Keyu, Li, Jin, Yang, Shu, Zhang, Zhuoran, Wang, Di
Large Language Models (LLMs) often exhibit sycophantic behavior, agreeing with user-stated opinions even when those contradict factual knowledge. While prior work has documented this tendency, the internal mechanisms that enable such behavior remain poorly understood. In this paper, we provide a mechanistic account of how sycophancy arises within LLMs. We first systematically study how user opinions induce sycophancy across different model families. We find that simple opinion statements reliably induce sycophancy, whereas user expertise framing has a negligible impact. Through logit-lens analysis and causal activation patching, we identify a two-stage emergence of sycophancy: (1) a late-layer output preference shift and (2) deeper representational divergence. We also verify that user authority fails to influence behavior because models do not encode it internally. In addition, we examine how grammatical perspective affects sycophantic behavior, finding that first-person prompts (``I believe...'') consistently induce higher sycophancy rates than third-person framings (``They believe...'') by creating stronger representational perturbations in deeper layers. These findings highlight that sycophancy is not a surface-level artifact but emerges from a structural override of learned knowledge in deeper layers, with implications for alignment and truthful AI systems.
A beginner's guide to ChatGPT: Make AI work for you
When you purchase through links in our articles, we may earn a small commission. We'll show you how to get started, what you can do, and how to make ChatGPT work for you instead of the other way round. Hardly anyone can have missed the AI phenomenon that has taken the world by storm. Almost every major company has some kind of AI initiative now. Politicians talk about how important it is not to "fall behind in the AI race," and hundreds of millions have started using AI chatbots. The AI wave took off when OpenAI released its chatbot ChatGPT, which gives large language models a conversational interface.
I Can't Stop Playing Duolingo Chess
I'm embarrassed to admit this in my mid-forties, but I've never understood chess well enough to play a full game. My son and daughter both learned how to play in elementary school. I was glad they had that experience. I tried to pick up the game when they did, but, as a busy mom of three little kids, I just didn't have the time, the interest, or the stamina to really sit down and learn. Chess became more popular during the pandemic, and the boom has stuck around; according to a recent Yougov.com
- Leisure & Entertainment > Games > Chess (1.00)
- Education > Educational Setting (0.73)
Salsa as a Nonverbal Embodied Language -- The CoMPAS3D Dataset and Benchmarks
Burkanova, Bermet, Yazdian, Payam Jome, Zhang, Chuxuan, Evans, Trinity, Tuttösí, Paige, Lim, Angelica
Imagine a humanoid that can safely and creatively dance with a human, adapting to its partner's proficiency, using haptic signaling as a primary form of communication. While today's AI systems excel at text or voice-based interaction with large language models, human communication extends far beyond text-it includes embodied movement, timing, and physical coordination. Modeling coupled interaction between two agents poses a formidable challenge: it is continuous, bidirectionally reactive, and shaped by individual variation. We present CoMPAS3D, the largest and most diverse motion capture dataset of improvised salsa dancing, designed as a challenging testbed for interactive, expressive humanoid AI. The dataset includes 3 hours of leader-follower salsa dances performed by 18 dancers spanning beginner, intermediate, and professional skill levels. For the first time, we provide fine-grained salsa expert annotations, covering over 2,800 move segments, including move types, combinations, execution errors and stylistic elements. We draw analogies between partner dance communication and natural language, evaluating CoMPAS3D on two benchmark tasks for synthetic humans that parallel key problems in spoken language and dialogue processing: leader or follower generation with proficiency levels (speaker or listener synthesis), and duet (conversation) generation. Towards a long-term goal of partner dance with humans, we release the dataset, annotations, and code, along with a multitask SalsaAgent model capable of performing all benchmark tasks, alongside additional baselines to encourage research in socially interactive embodied AI and creative, expressive humanoid motion generation.
- North America > United States > Illinois > Cook County > Chicago (0.04)
- North America > Canada > British Columbia > Metro Vancouver Regional District > Burnaby (0.04)
- Europe > United Kingdom (0.04)
- Europe > Netherlands > North Holland > Amsterdam (0.04)
- Health & Medicine > Therapeutic Area (0.94)
- Leisure & Entertainment (0.67)
Investing apps: which offer the most for beginners?
Rachel Reeves and her government colleagues are keen to get more Britons investing in the stock market. She said recently that a lot of money was being put into cash savings accounts "when it could be invested in equities, in stock markets, and earn a better return". The good news is that the rise of DIY tools and mobile apps means it is now easier than ever to get investing. However, the vast array of options can make it daunting to know where to start. For new investors who don't have the time or confidence to manage a portfolio, "robo-advisers" can be a good option. They might sound like something out of a sci-fi movie but are basically online investment platforms that use technology to help automate the process.
- Europe > United Kingdom (0.48)
- Europe > Italy (0.05)
This futuristic surfboard lets you fly above water at 25 mph
Have you ever imagined what it would be like to glide over the water, the wind whipping past your face and actually feel in control the whole time? If that sounds exciting, you'll want to check out the latest electric hydrofoil from Unifoil. The Hydroflyer Sport brings something new to the table with its handlebars, giving you extra control whether you're just starting out or you're always chasing your next thrill on the water. Join the FREE "CyberGuy Report": Get my expert tech tips, critical security alerts and exclusive deals, plus instant access to my free "Ultimate Scam Survival Guide" when you sign up! The Hydroflyer Sport is an electric hydrofoiling board that lets you "fly" above the water.
- Leisure & Entertainment > Sports (0.36)
- Media > News (0.31)
Develop valuable data visualization skills and learn to code for only 50
If you feel like tech advances have passed you by because you've never learned to code or use AI, you could not be more wrong. Thank goodness it's no longer necessary to return to school to develop new skills. You can now learn valuable data wrangling skills and learn how to code with the Microsoft Visual Studio Professional 2022 The Premium Learn to Code Certification Bundle. It should be no surprise that Microsoft Visual Studio Professional 2022 has a perfect 5-star rating on Microsoft Choice Software. The Live Share feature makes collaboration seamless, CodeLens provides deep insights from your code, and Intellicode tops it all off by allowing you to type less while coding more.
- Education (0.53)
- Marketing (0.40)
- Information Technology (0.36)