Goto

Collaborating Authors

 Large Language Model


Analysis and Explainability of LLMs Via Evolutionary Methods

arXiv.org Machine Learning

Evolutionary methods have long been useful for analysis and explanation in genetics, biology, ecology, and related fields. In this work, we extend these methods to neural networks, specifically large language models (LLMs), to better analyze and explain relationships among models. We show how relating weights to genotypes and output text to phenotypes can improve our understanding of model lineage, important datasets, the roles of different model layers, and visualization of model relationships. We demonstrate this in a controlled experiment, where our estimated evolutionary trees reliably recover the topology of the ground-truth training tree. We further identify the most important weight layers according to weight differences and show through phenotypic experiments that one training dataset appears to contribute more useful information than the others. Finally, we generate an unsupervised evolutionary tree of black-box foundation models. Throughout, we provide visualizations that support a clearer understanding of evolutionary relationships among LLMs.


Task Vector Geometry Underlies Dual Modes of Task Inference in Transformers

arXiv.org Machine Learning

Transformers are effective at inferring the latent task from context via two inference modes: recognizing a task seen during training, and adapting to a novel one. Recent interpretability studies have identified from middle-layer representations task-specific directions, or task vectors, that steer model behavior. However, a lack of rigorous foundations hinders connecting internal representations to external model behavior: existing work fails to explain how task-vector geometry is shaped by the training distribution, and what geometry enables out-of-distribution (OOD) generalization. In this paper, we study these questions in a controlled synthetic setting by training small transformers from scratch on latent-task sequence distributions, which allows a principled mathematical characterization. We show that two inference modes can coexist within a single model. In-distribution behavior is governed by Bayesian task retrieval, implemented internally through convex combinations of learned task vectors. OOD behavior, by contrast, arises through extrapolative task learning, whose representations occupy a subspace nearly orthogonal to the task-vector subspace. Taken together, our results suggest that task-vector geometry, training distributions, and generalization behaviors are closely related.


Xbox is ditching Microsoft's Copilot AI

Engadget

Xbox is ditching Microsoft's Copilot AI Xbox is ditching Microsoft's Copilot AI Microsoft announced plans to start stripping Copilot out of select Windows apps in March after criticism of the company's mishandling of its operating system reached a fever pitch. As it turns out though, Windows isn't the only place where you'll see less Copilot: Xbox CEO Asha Sharma has announced that the AI assistant will also be removed from the gaming brand's mobile app and Xbox consoles. Under previous Xbox leadership, Copilot was introduced as a sort of in-game assistant that would be aware of what you're playing and able to offer contextual advice based on what's on your screen. Microsoft launched a beta version of the experience by adding Copilot to the Xbox mobile app in May 2025, but based on a GDC presentation the company gave in March, the plan was to also bring Copilot to Xbox consoles later this year. Those plans apparently don't align with where Xbox is headed, Sharma said in a post announcing new hires to the Xbox division.


ChatGPT's new default model is dialing back the annoying emojis

PCWorld

PCWorld reports the update delivers 52.5% fewer hallucinations and 37.3% fewer inaccurate claims while providing more concise answers. Enhanced features include improved context integration from previous chats, files, and Gmail, plus transparency showing which memory sources influenced responses. One reason I took a break from ChatGPT a few months ago (I'm back now) was how sick to death I got of its constant emojis, especially when it came to lists. The brain emoji was my least favorite, along with the green checkmarks, the pointy fingers, and the yellow "hazard" signs. Well, I'll believe it when I see it, but with its latest "instant" model, OpenAI promises that we'll be getting way less of those "gratuitous" emojis in ChatGPT's responses.


US to safety test new AI models from Google, Microsoft, xAI

BBC News

New artificial intelligence (AI) tools and capabilities from Google, Microsoft and xAI will now be tested by the US Department of Commerce before they are released to the public. The tech firms have agreed to voluntarily submit their models for testing through Commerce's Center for AI Standards and Innovation (CAISI). The new pacts are an expansion on agreements by AI companies like OpenAI and Anthropic that were reached during the Biden Administration, and will see AI models from all of the companies evaluated for their capabilities and security. These expanded industry collaborations help us scale our work in the public interest at a critical moment, CAISI's director Chris Fall said. Overall, the evaluations of the AI tools will cover testing, collaborative research and best practice development related to commercial AI systems.


The Download: inside the Musk v. Altman trial, and AI for democracy

MIT Technology Review

Plus: The Pentagon has struck sweeping AI deals for classified work. Week one of the Musk v. Altman trial: what it was like in the room Two of the most powerful figures in AI--Sam Altman and Elon Musk--are in the middle of a landmark legal showdown, with Musk alleging he was misled about OpenAI becoming a for-profit company. Our reporter Michelle Kim, who also happens to be a lawyer, has been in court each day, and has broken down the first week's key moments in her latest report . In a new Q&A, she also reveals what it was like in the room, the new details that have emerged about how Musk and OpenAI operate--and what we can expect from this week's proceedings. Find out what she's discovered so far, and if you want to keep up with MIT Technology Review's ongoing coverage of the Musk v. Altman trial, follow @techreview or @michelletomkim on X. Faster than many realize, AI is becoming the primary interface through which we form beliefs and participate in democratic self-governance. This shift could further strain already fragile institutions, but it could also help address problems like polarization and declining civic engagement.


He Couldn't Land a Job Interview. Was AI to Blame?

WIRED

Armed with some Python and a white-hot sense of injustice, one medical student spent six months trying to figure out whether an algorithm trashed his job application. It was mid-October, peak leaf-peeping season in Hanover, New Hampshire, and Chad Markey was on a rare break between clinical rotations during his last year of medical school. He should have been inhaling Green Mountain air and gossiping with his Dartmouth classmates about life after graduation. In a few months, they'd all be going their separate ways to start residency training at hospitals around the country. Instead, Markey was alone in his apartment, deep down a rabbit hole, preparing to go to war. He'd wake each morning, eat breakfast, open his laptop at the kitchen table or settle into the tan armchair with the good back support, and start coding . Some days, he wouldn't notice the sun had gone down until one of his roommates came home and asked why the lights weren't on. For days, Markey had been scrolling through a Discord group about medical residency, a font of crowdsourced knowledge where students report back to their peers on every stage of the application and selection process. He'd watched as other students, lots of them, posted about the interview invitations they'd received.


Evaluating LLMs on Large-Scale Graph Property Estimation via Random Walks

arXiv.org Machine Learning

With the rapidly improving reasoning abilities of Large Language Models (LLMs), there is also a rising demand to use them in a wide variety of domains. This brings about the need to carefully evaluate the limits of the capabilities of these models with various tests and benchmarks. Graph structures are ubiquitous in real-world data, and are often used to represent and analyze relationship patterns within data. Many benchmarks have already been proposed in the graph literature to test the reasoning ability of LLMs to follow and execute graph algorithms. However, due to the limited context length of LLMs, these benchmarks consist of very small graphs. In real-world data, the size of graphs can be significantly larger, and in many cases, not fully accessible. In this paper, we examine a class of problems that arises with very large graphs having limited accessibility. We propose a large graph benchmark dataset, EstGraph, and introduce four distinct tasks designed to estimate large graph properties. We evaluate the reasoning abilities of LLMs on these tasks using a wide variety of graph datasets. In addition, we provide task-specific prompt constructions based on random walk sampling of large graphs (up to millions of nodes) that effectively convey sufficient information to LLMs within the limits of context length.


Greg Brockman Defends 30B OpenAI Stake: 'Blood, Sweat, and Tears'

WIRED

OpenAI's cofounder and president revealed in federal court on Monday that he's one of the largest individual stakeholders in the AI lab. Two days before the Musk v. Altman trial began, Elon Musk asked OpenAI cofounder and president Greg Brockman about reaching a settlement. When Brockman suggested both sides drop their claims, Musk responded, "By the end of this week, you and Sam [Altman] will be the most hated men in America. If you insist, so be it." The message --which OpenAI's lawyers made public on Sunday, and which Judge Yvonne Gonzalez Rogers subsequently refused to let the jury hear about--underscores what may be Musk's larger goal in this trial.


I love my new Codex AI pet -- and now I want one in every app

PCWorld

PCWorld explores OpenAI's new Codex AI pets, which provide visual status indicators for desktop AI agents through customizable on-screen companions. These pets address a key user experience issue by displaying red clocks when agent approval is needed and green checks upon task completion. The feature enhances multitasking efficiency by keeping users informed of AI agent activity without constant monitoring of the main interface. Whether I'm using Claude's desktop Cowork application or OpenAI's Codex coding app, I prefer that my AI agents check back with me before making high-stakes decisions. But while that makes for a safer setup, it also means my agents are often waiting around, twiddling their thumbs as they wait for me to approve their next steps. Now, if I'm sitting and watching the Cowork or Codex apps in action, I'll see right away when an agent is awaiting my approval. But if I'm working in another window or multitasking, I could easily miss the fact that an idled Cowork or Codex agent is sitting around, staring vacantly into space.