Goto

Collaborating Authors

 nate


MathRobust-LV: Evaluation of Large Language Models' Robustness to Linguistic Variations in Mathematical Reasoning

Kirtane, Neeraja, Khanna, Yuvraj, Relan, Peter

arXiv.org Artificial Intelligence

Large language models excel on math benchmarks, but their math reasoning robustness to linguistic variation is underexplored. While recent work increasingly treats high-difficulty competitions like the IMO as the gold standard for evaluating reasoning, we believe in comprehensive benchmarking of high school-level math problems in real educational settings. We introduce MathRobust-LV, a test set and evaluation methodology that mirrors how instructors rephrase problems across assessments while keeping difficulty constant: we change surface details (names, contexts, variables) while preserving numerical structure and answers. In contrast to prior efforts that alter problem content or emphasize IMO-level tasks, we focus on high-school-level dataset problems at the difficulty level where models are currently deployed in educational settings: tutoring and assessment systems. In these applications, instructors rephrase identical concepts in varied ways, making linguistic robustness essential for reliable deployment. Although MATH data benchmarking is often regarded as saturated, our experiment on 34 models reveals that accuracy declines when moving from the baseline to the variants. These drops are severe for smaller models (9-11%) while stronger models also show measurable degradation. Frontier models like GPT-5, Gemini-2.5pro remain comparatively stable. Our results highlight that robustness to linguistic variation is a fundamental challenge, exposing reasoning vulnerabilities in models.


'Baby Steps' Is a Hiking Game That Trolls 'Slightly Problematic' Men

WIRED

Is a Hiking Game That Trolls'Slightly Problematic' Men The walking simulator, launching September 23 on PlayStation and Steam, stars a jobless 35-year-old "privileged, white male" whose pride stops him from getting help. Game developer Bennett Foddy was watching a Greek myth unfold in front of him. A playtester for his latest project,, was struggling to navigate the game's lead--Nate, a 35-year-old "failson" in a stained onesie--up a slippery hill. Each time, the terrain proved to be too much, and Nate skidded uselessly down it. Foddy has a reputation for making onerous games that take a little bit of masochism to master.


Tech founder charged with fraud for 'AI' that was secretly overseas contract workers

Engadget

The US Department of Justice has indicted Albert Sangier for defrauding investors with misleading statements about his Nate financial technology platform. Founded by Sangier in 2018, Nate claimed it could offer shoppers a universal checkout app thanks to artificial intelligence. However, the indictment states that the so-called AI-powered transactions in Nate were actually completed by human contractors in the Philippines and Romania or by bots. Sangier raised more than 40 million from investors for the app. This case follows reporting by The Information in 2022 that cast light on Nate's use of human labor rather than AI.


Engadget Podcast: How AI will shape Apple's WWDC 2024

Engadget

We're gearing up to cover Apple's Worldwide Developers Conference (WWDC) next week! In this episode, Cherlynn and Devindra dive into everything they expect at WWDC: Tons of AI announcements; more on iOS 18, iPadOS 18, and macOS 15; and hopefully some improvements for Vision Pro and visionOS. In addition, we chat about what we expect to see at Summer Game Fest and demonstrate how we used an AI editing tool to clear up some awful podcast audio. Devindra also talks with Justin Samuels, the founder of Render ATL, about why he started a massive tech conference in Atlanta. Listen below or subscribe on your podcast app of choice. If you've got suggestions or topics you'd like covered on the show, be sure to email us or drop a note in the comments! And be sure to check out our other podcast, Engadget News! Humane AI warns users its battery case "may pose a fire risk" – 34:36 Welcome back to the Engadget podcast. This week we are getting ready for WWDC 2024 happening in a couple of days.


Engadget Podcast: MoviePass founder Stacy Spikes on the MovieCrash documentary

Engadget

In this episode, Cherlynn and Devindra discuss Copilot+ and the potential rise of Arm-based Windows systems, and we dive into the new Surface Pro and Surface Laptop.


Evaluating Very Long-Term Conversational Memory of LLM Agents

Maharana, Adyasha, Lee, Dong-Ho, Tulyakov, Sergey, Bansal, Mohit, Barbieri, Francesco, Fang, Yuwei

arXiv.org Artificial Intelligence

Existing works on long-term open-domain dialogues focus on evaluating model responses within contexts spanning no more than five chat sessions. Despite advancements in long-context large language models (LLMs) and retrieval augmented generation (RAG) techniques, their efficacy in very long-term dialogues remains unexplored. To address this research gap, we introduce a machine-human pipeline to generate high-quality, very long-term dialogues by leveraging LLM-based agent architectures and grounding their dialogues on personas and temporal event graphs. Moreover, we equip each agent with the capability of sharing and reacting to images. The generated conversations are verified and edited by human annotators for long-range consistency and grounding to the event graphs. Using this pipeline, we collect LoCoMo, a dataset of very long-term conversations, each encompassing 300 turns and 9K tokens on avg., over up to 35 sessions. Based on LoCoMo, we present a comprehensive evaluation benchmark to measure long-term memory in models, encompassing question answering, event summarization, and multi-modal dialogue generation tasks. Our experimental results indicate that LLMs exhibit challenges in understanding lengthy conversations and comprehending long-range temporal and causal dynamics within dialogues. Employing strategies like long-context LLMs or RAG can offer improvements but these models still substantially lag behind human performance.


'He's clumsy, high, and completely unprepared': Baby Steps, a game about falling flat on your face

The Guardian

Game developers Gabe Cuzzillo, Maxi Boch and Bennett Foddy have been friends for over a decade, having met through NYU's Game Centre, and they already have one successful indie game under their belt. Ape Out had you smashing through halls full of goons as a rampaging gorilla to the tune of a procedural jazz soundtrack. Their next game, Baby Steps, is equally unconventional. A walking simulator in a very literal sense, you'll awkwardly steer Nate, a caked-up, onesie-wearing basement dweller, to the top of a misty mountain. "On a controller, players use the triggers to lift and plant each foot while using the left stick to move the lifted foot around in the air, manually taking each of Nate's steps," Cuzzillo says.


Novice Type Error Diagnosis with Natural Language Models

Geng, Chuqin, Ye, Haolin, Li, Yixuan, Han, Tianyu, Pientka, Brigitte, Si, Xujie

arXiv.org Artificial Intelligence

Strong static type systems help programmers eliminate many errors without much burden of supplying type annotations. However, this flexibility makes it highly non-trivial to diagnose ill-typed programs, especially for novice programmers. Compared to classic constraint solving and optimization-based approaches, the data-driven approach has shown great promise in identifying the root causes of type errors with higher accuracy. Instead of relying on hand-engineered features, this work explores natural language models for type error localization, which can be trained in an end-to-end fashion without requiring any features. We demonstrate that, for novice type error diagnosis, the language model-based approach significantly outperforms the previous state-of-the-art data-driven approach. Specifically, our model could predict type errors correctly 62% of the time, outperforming the state-of-the-art Nate's data-driven model by 11%, in a more rigorous accuracy metric. Furthermore, we also apply structural probes to explain the performance difference between different language models.


La veille de la cybersécurité

#artificialintelligence

If you can't be bothered to fill out your credit card and address details when shopping for jeans online, the Nate app sounds like a service you might want. The company bills itself as an "artificial intelligence startup" that uses AI to auto-fill customer information for $1 per transaction, saving shoppers a few minutes when completing purchases through the Nate app. But instead of using high-tech methods to complete purchases, Nate transactions were often handled manually by workers in the Philippines, according to a deep dive by The Information. Speaking to two people with direct access to Nate's internal data, The Information reports that "the share of transactions Nate handled manually rather than automatically ranged between 60 percent and 100 percent" throughout 2021. One person with knowledge of fundraising efforts told the outlet that the company didn't share its manual process with some potential investors while the company was trying to raise money.


Go read this report on an AI shopping app that was actually just using humans

#artificialintelligence

If you can't be bothered to fill out your credit card and address details when shopping for jeans online, the Nate app sounds like a service you might want. The company bills itself as an "artificial intelligence startup" that uses AI to auto-fill customer information for $1 per transaction, saving shoppers a few minutes when completing purchases through the Nate app. But instead of using high-tech methods to complete purchases, Nate transactions were often handled manually by workers in the Philippines, according to a deep dive by The Information. Speaking to two people with direct access to Nate's internal data, The Information reports that "the share of transactions Nate handled manually rather than automatically ranged between 60 percent and 100 percent" throughout 2021. One person with knowledge of fundraising efforts told the outlet that the company didn't share its manual process with some potential investors while the company was trying to raise money.