Large Language Model
A plug-and-play Transformer module for task-agnostic reasoning
Large language models (LLMs) exhibit in-context learning abilities which enable the same model to perform several tasks without any task-specific training. In contrast, traditional adaptation approaches, such as fine-tuning, modify the underlying models for each specific task. In-context learning, however, consistently underperforms task-specific tuning approaches even when presented with the same examples. While most existing approaches (e.g., prompt engineering) focus on the LLM's learned representations to patch this performance gap, our experiments actually reveal that LLM representations contain sufficient information to make good predictions. As such, we focus on the LLM's reasoning abilities and demonstrate that this performance gap exists due to their inability to perform simple probabilistic reasoning tasks. This raises an intriguing question: Are LLMs actually capable of learning how to reason in a task-agnostic manner?
TextDiffuser: Diffusion Models as Text Painters Jingye Chen
Diffusion models have gained increasing attention for their impressive generation abilities but currently struggle with rendering accurate and coherent text. To address this issue, we introduce TextDiffuser, focusing on generating images with visually appealing text that is coherent with backgrounds. TextDiffuser consists of two stages: first, a Transformer model generates the layout of keywords extracted from text prompts, and then diffusion models generate images conditioned on the text prompt and the generated layout. Additionally, we contribute the first large-scale text images dataset with OCR annotations, MARIO-10M, containing 10 million image-text pairs with text recognition, detection, and character-level segmentation annotations. We further collect the MARIO-Eval benchmark to serve as a comprehensive tool for evaluating text rendering quality. Through experiments and user studies, we show that TextDiffuser is flexible and controllable to create high-quality text images using text prompts alone or together with text template images, and conduct text inpainting to reconstruct incomplete images with text. The code, model, and dataset will be available at https://aka.ms/textdiffuser.
Learning Universal Policies via Text-Guided Video Generation
A goal of artificial intelligence is to construct an agent that can solve a wide variety of tasks. Recent progress in text-guided image synthesis has yielded models with an impressive ability to generate complex novel images, exhibiting combinatorial generalization across domains. Motivated by this success, we investigate whether such tools can be used to construct more general-purpose agents. Specifically, we cast the sequential decision making problem as a text-conditioned video generation problem, where, given a text-encoded specification of a desired goal, a planner synthesizes a set of future frames depicting its planned actions in the future, after which control actions are extracted from the generated video. By leveraging text as the underlying goal specification, we are able to naturally and combinatorially generalize to novel goals. The proposed policy-as-video formulation can further represent environments with different state and action spaces in a unified space of images, which, for example, enables learning and generalization across a variety of robot manipulation tasks.
OpenAI's new ChatGPT agent can perform interactive tasks on your behalf
Imagine an AI bot that can fill out online forms, book airline flights, order groceries, and more. That's the intent of OpenAI's new Operator, an AI that acts as an independent agent to carry out your commands all on its own. Released as a research preview on Thursday, Operator is able to interact directly with a web browser. That means it can navigate web pages by typing, scrolling, and clicking in all the right spots, just as you would yourself. The difference here is that Operator aims to do all that without any intervention on your part.
Operator isn't worth its 200-per-month ChatGPT Pro subscription yet - here's why
This week, OpenAI is introducing a research preview called Operator. I initially wanted to do a hands-on, but once I found out that you need a Pro account (which costs 200 per month), I decided to watch the various OpenAI demos, share them with you, and then share my thoughts. Altman did say that users of the 20-per-month Plus plan would eventually be able to use Operator. Operator is an AI agent. Fundamentally, it simulates keyboard and mouse clicks in a browser, reading the screen, and performing actions. Also: Have a genealogy mystery?
The Download: OpenAI's agent, and what to expect from robotics
What's new: After weeks of buzz, OpenAI has released Operator, its first AI agent. Operator is a web app that can carry out simple online tasks in a browser, such as booking concert tickets or filling an online grocery order. The app is powered by a new model called Computer-Using Agent--CUA for short--built on top of OpenAI's multimodal large language model GPT-4o. Why it matters: OpenAI claims that Operator outperforms similar rival tools, including Anthropic's Computer Use and Google DeepMind's Mariner. The fact that three of the world's top AI firms have converged on the same vision of what agent-based models could be makes one thing clear.
Grounded Mathematical Proof Generation with Language Models
Theorem proving in natural mathematical language - the mixture of symbolic and natural language used by humans - plays a central role in mathematical advances and education, and tests aspects of reasoning that are core to intelligence. Yet it has remained underexplored with modern generative models. We study largescale language models on two new generation tasks: suggesting the next step in a mathematical proof, and full proof generation.
Blockwise Parallel Transformers for Large Context Models
Transformers have emerged as the cornerstone of state-of-the-art natural language processing models, showcasing exceptional performance across a wide range of AI applications. However, the memory demands posed by the self-attention mechanism and the large feedforward network in Transformers limit their ability to handle long sequences, thereby creating challenges for tasks involving multiple long sequences or long-term dependencies. We present a distinct approach, Blockwise Parallel transformers (BPT), that leverages blockwise computation of self-attention and feedforward network fusion to minimize memory costs. By processing longer input sequences while maintaining memory efficiency, BPT enables training sequences 32 times longer than vanilla Transformers and up to 4 times longer than previous memory-efficient methods. Extensive experiments on language modeling and reinforcement learning tasks demonstrate the effectiveness of BPT in reducing memory requirements and improving performance.
Roll over, Darwin: How Google DeepMind's 'mind evolution' could enhance AI thinking
One of the big trends in artificial intelligence in the past year has been the employment of various tricks during inference -- the act of making predictions -- to dramatically improve the accuracy of those predictions. For example, chain-of-thought -- having a large language model (LLM) spell out the logic of an answer in a series of statements -- can lead to increased accuracy on benchmark tests. Such "thinking" has apparently led to breakthroughs in accuracy on abstract tests of problem-solving, such as OpenAI's GPTo3's high score last month on the ARC-AGI test. Also: OpenAI's o3 isn't AGI yet but it just did something no other AI has done It turns out, however, that LLMs still fall short on very practical tests, something as simple as planning a trip. Google DeepMind researchers, led by Kuang-Huei Lee, pointed out in a report last week that Google's Gemini and OpenAI's GPTo1, the companies' best respective models, fail miserably when tested on TravelPlanner, a benchmark test introduced last year by scholars at Fudon University, Penn State, and Meta AI.
Reflexion: Language Agents with Verbal Reinforcement Learning
Large language models (LLMs) have been increasingly used to interact with external environments (e.g., games, compilers, APIs) as goal-driven agents. However, it remains challenging for these language agents to quickly and efficiently learn from trial-and-error as traditional reinforcement learning methods require extensive training samples and expensive model fine-tuning. We propose Reflexion, a novel framework to reinforce language agents not by updating weights, but instead through linguistic feedback. Concretely, Reflexion agents verbally reflect on task feedback signals, then maintain their own reflective text in an episodic memory buffer to induce better decision-making in subsequent trials. Reflexion is flexible enough to incorporate various types (scalar values or free-form language) and sources (external or internally simulated) of feedback signals, and obtains significant improvements over a baseline agent across diverse tasks (sequential decision-making, coding, language reasoning). For example, Reflexion achieves a 91% pass@1 accuracy on the HumanEval coding benchmark, surpassing the previous state-of-the-art GPT-4 that achieves 80%. We also conduct ablation and analysis studies using different feedback signals, feedback incorporation methods, and agent types, and provide insights into how they affect performance.