Goto

Collaborating Authors

 Large Language Model


An efficient probabilistic hardware architecture for diffusion-like models

arXiv.org Artificial Intelligence

The proliferation of probabilistic AI has prompted proposals for specialized stochastic computers. Despite promising efficiency gains, these proposals have failed to gain traction because they rely on fundamentally limited modeling techniques and exotic, unscalable hardware. In this work, we address these shortcomings by proposing an all-transistor probabilistic computer that implements powerful denoising models at the hardware level. A system-level analysis indicates that devices based on our architecture could achieve performance parity with GPUs on a simple image benchmark using approximately 10,000 times less energy.


UltraCUA: A Foundation Model for Computer Use Agents with Hybrid Action

arXiv.org Artificial Intelligence

Computer-use agents face a fundamental limitation. They rely exclusively on primitive GUI actions (click, type, scroll), creating brittle execution chains prone to cascading failures. While API-driven agents harness rich capabilities through structured interfaces and tools, computer-use agents remain constrained to low-level visual interactions. We present UltraCUA, a foundation model that transcends this limitation through hybrid action-seamlessly unifying primitive GUI operations with high-level tool execution. Our innovation rests on four critical advances. First, an automated pipeline extracts and scales tool capabilities from software documentation and code repositories. Second, a synthetic data engine produces 17,000+ verifiable tasks capturing real-world computer-use complexity. Third, comprehensive hybrid action trajectory collection incorporates both GUI primitives and strategic tool calls. Fourth, a two-stage training methodology combines supervised fine-tuning with online reinforcement learning, enabling intelligent action selection between GUI and API. Evaluation with our 7B and 32B UltraCUA models reveals transformative performance gains. On OSWorld, UltraCUA achieves 22% relative improvement while executing 11% faster than existing approaches, averagely. Cross-domain validation on WindowsAgentArena demonstrates robust generalization with 21.7% success rate, surpassing Windows-trained baselines. The hybrid action paradigm proves essential, reducing error propagation while improving execution efficiency. This work establishes a scalable paradigm bridging primitive GUI interactions and high-level tool intelligence, enabling more resilient and adaptable computer use agents for diverse environments and complex real-world tasks.


Same model, better performance: the impact of shuffling on DNA Language Models benchmarking

arXiv.org Artificial Intelligence

Large Language Models are increasingly popular in genomics due to their potential to decode complex biological sequences. Hence, researchers require a standardized benchmark to evaluate DNA Language Models (DNA LMs) capabilities. However, evaluating DNA LMs is a complex task that intersects genomic's domain-specific challenges and machine learning methodologies, where seemingly minor implementation details can significantly compromise benchmark validity. We demonstrate this through BEND (Benchmarking DNA Language Models), where hardware-dependent hyperparameters -- number of data loading workers and buffer sizes -- create spurious performance variations of up to 4% for identical models. The problem stems from inadequate data shuffling interacting with domain specific data characteristics. Experiments with three DNA language models (HyenaDNA, DNABERT-2, ResNet-LM) show these artifacts affect both absolute performance and relative model rankings. We propose a simple solution: pre-shuffling data before storage eliminates hardware dependencies while maintaining efficiency. This work highlights how standard ML practices can interact unexpectedly with domain-specific data characteristics, with broader implications for benchmark design in specialized domains.


ARE: Scaling Up Agent Environments and Evaluations

arXiv.org Artificial Intelligence

We introduce Meta Agents Research Environments (ARE), a research platform for scalable creation of environments, integration of synthetic or real applications, and execution of agentic orchestrations. ARE provides simple abstractions to build complex and diverse environments, each with their own rules, tools, content, and verifiers, helping to bridge the gap between model development and real-world deployment. We also propose Gaia2, a benchmark built in ARE and designed to measure general agent capabilities. Beyond search and execution, Gaia2 requires agents to handle ambiguities and noise, adapt to dynamic environments, collaborate with other agents, and operate under temporal constraints. Unlike prior benchmarks, Gaia2 runs asynchronously, surfacing new failure modes that are invisible in static settings. Our experiments show that no system dominates across the intelligence spectrum: stronger reasoning often comes at the cost of efficiency, and budget scaling curves plateau, highlighting the need for new architectures and adaptive compute strategies. Perhaps more importantly, ARE abstractions enable continuous extension of Gaia2 to other environments, empowering the community to rapidly create new benchmarks tailored to their domains. In AI's second half, progress increasingly depends on defining meaningful tasks and robust evaluations to drive frontier capabilities forward.


Reparameterized LLM Training via Orthogonal Equivalence Transformation

arXiv.org Artificial Intelligence

While large language models (LLMs) are driving the rapid advancement of artificial intelligence, effectively and reliably training these large models remains one of the field's most significant challenges. To address this challenge, we propose POET, a novel reParameterized training algorithm that uses Orthogonal Equivalence Transformation to optimize neurons. Specifically, POET reparameterizes each neuron with two learnable orthogonal matrices and a fixed random weight matrix. Because of its provable preservation of spectral properties of weight matrices, POET can stably optimize the objective function with improved generalization. We further develop efficient approximations that make POET flexible and scalable for training large-scale neural networks. Extensive experiments validate the effectiveness and scalability of POET in training LLMs.


Disney wants you to AI-generate yourself into your favorite Marvel movie

The Guardian

Users of OpenAI's video generation app will soon be able to see their own faces alongside characters from Marvel, Pixar, Star Wars and Disney's animated films, according to a joint announcement from the startup and Disney on Thursday. Perhaps you, Lightning McQueen and Iron Man are all dancing together in the Mos Eisley Cantina. Sora is an app made by OpenAI, the firm behind ChatGPT, which allows users to generate videos of up to 20 seconds through short text prompts. Disney announced that it would invest $1bn in OpenAI and, under a three-year deal perhaps worth even more than that large sum, that it would license about 200 of its iconic characters - from R2-D2 to Stitch - for users to play with in OpenAI's video generation app. Examples of content generated by OpenAI's Sora with Disney properties.


Disney's deal with OpenAI is about controlling the future of copyright

Engadget

It's no accident the company picked a partner it could control. This morning Disney and OpenAI announced a three-year licensing agreement: Starting in 2026, ChatGPT and Sora can generate images and videos incorporating Disney IP, including more than 200 characters from the company's stable of Star Wars, Pixar and Marvel brands. To say these companies make for strange bedfellows is an understatement. Before OpenAI released Sora, the company reportedly notified studios and talent agencies they would need to opt out of having their work appear in the new app. The law effectively froze the advancement of the public domain in the United States, with Disney being the greatest beneficiary. On the face of it, it's unclear OpenAI is getting much value out of the deal.


I Am Time Magazine's Person of the Year

The Atlantic - Technology

It's rude to boast, but here in 2025, you've got to take the wins where you can get them. This morning, magazine announced its Person of the Year, and it's me. If you want to get all technical about it, 's Person of the Year is not a person at all but a collection of people: the architects of AI. One of the two covers released is a re-creation of the "Lunch Atop a Skyscraper" photograph from 1932, which depicted blue-collar ironworkers suspended hundreds of feet in the air during the construction of 30 Rockefeller Plaza. In its image, replaces these laborers with tech personalities such as Mark Zuckerberg, Elon Musk, Sam Altman, and Jensen Huang.


OpenAI releases GPT-5.2 to take on Google and Anthropic

Engadget

OpenAI releases GPT-5.2 to take on Google and Anthropic The new model is all about professional work. OpenAI's code red response to Google's Gemini 3 Pro has arrived . On the same day the company announced a Sora licensing pact with Disney, it took the wraps off GPT-5.2 . OpenAI is touting the new model as its best yet for real-world, professional use. "It's better at creating spreadsheets, building presentations, writing code, perceiving images, understanding long contexts, using tools, and handling complex, multi-step projects," said OpenAI.


Lawsuit accuses ChatGPT of reinforcing delusions that led to a woman's death

Engadget

Lawsuit accuses ChatGPT of reinforcing delusions that led to a woman's death Stein-Erik Soelberg killed his mother and took his own life back in August. OpenAI has been hit with a wrongful death lawsuit after a man back in August, . The suit names CEO Sam Altman and accuses ChatGPT of putting a target on the back of victim Suzanne Adams, an 83-year-old woman who was killed in her home. The victim's estate, 56-year-old Stein-Erik Soelberg, engaged in delusion-soaked conversations with ChatGPT in which the bot validated and magnified certain paranoid beliefs. The suit goes on to suggest that the chatbot eagerly accepted delusional thoughts leading up to the murder and egged him on every step of the way.