The Shaped Transformer: Attention Models in the Infinite Depth-and-Width Limit
In deep learning theory, the covariance matrix of the representations serves as aproxy to examine the network's trainability. Motivated by the success of Transform-ers, we study the covariance matrix of a modified Softmax-based attention modelwith skip connections in the proportional limit of infinite-depth-and-width. Weshow that at initialization the limiting distribution can be described by a stochasticdifferential equation (SDE) indexed by the depth-to-width ratio. To achieve awell-defined stochastic limit, the Transformer's attention mechanism is modifiedby centering the Softmax output at identity, and scaling the Softmax logits by awidth-dependent temperature parameter. We examine the stability of the networkthrough the corresponding SDE, showing how the scale of both the drift and diffu-sion can be elegantly controlled with the aid of residual connections.
Weakly Supervised 3D Open-vocabulary Segmentation
Open-vocabulary segmentation of 3D scenes is a fundamental function of human perception and thus a crucial objective in computer vision research. However, this task is heavily impeded by the lack of large-scale and diverse 3D open-vocabulary segmentation datasets for training robust and generalizable models. Distilling knowledge from pre-trained 2D open-vocabulary segmentation models helps but it compromises the open-vocabulary feature as the 2D models are mostly finetuned with close-vocabulary datasets. We tackle the challenges in 3D open-vocabulary segmentation by exploiting pre-trained foundation models CLIP and DINO in a weakly supervised manner. Specifically, given only the open-vocabulary text descriptions of the objects in a scene, we distill the open-vocabulary multimodal knowledge and object reasoning capability of CLIP and DINO into a neural radiance field (NeRF), which effectively lifts 2D features into view-consistent 3D segmentation.
Fine-Tuning Language Models with Just Forward Passes
Fine-tuning language models (LMs) has yielded success on diverse downstream tasks, but as LMs grow in size, backpropagation requires a prohibitively large amount of memory. Zeroth-order (ZO) methods can in principle estimate gradients using only two forward passes but are theorized to be catastrophically slow for optimizing large models. In this work, we propose a memory-efficient zerothorder optimizer (MeZO), adapting the classical ZO-SGD method to operate in-place, thereby fine-tuning LMs with the same memory footprint as inference. For example, with a single A100 80GB GPU, MeZO can train a 30-billion parameter model, whereas fine-tuning with backpropagation can train only a 2.7B LM with the same budget. We conduct comprehensive experiments across model types (masked and autoregressive LMs), model scales (up to 66B), and downstream tasks (classification, multiple-choice, and generation).
Super Speeders are deadly. This technology can slow them down.
Breakthroughs, discoveries, and DIY tips sent every weekday. In 2013, Amy Cohen experienced the unthinkable for a parent. It was a mild October day in New York City and her 12-year-old son Sammy stopped by the house to grab a snack on his way from school to soccer practice. When he stepped out onto their street in Brooklyn, Sammy was struck and killed by a speeding van. "It's a horror no parent should ever experience," Cohen told Popular Science.
SPAE: Semantic Pyramid AutoEncoder for Multimodal Generation with Frozen LLMs
In this work, we introduce Semantic Pyramid AutoEncoder (SPAE) for enabling frozen LLMs to perform both understanding and generation tasks involving non-linguistic modalities such as images or videos. SPAE converts between raw pixels and interpretable lexical tokens (or words) extracted from the LLM's vocabulary. The resulting tokens capture both the rich semantic meaning and the fine-grained details needed for visual reconstruction, effectively translating the visual content into a language comprehensible to the LLM, and empowering it to perform a wide array of multimodal tasks. Our approach is validated through in-context learning experiments with frozen PaLM 2 and GPT 3.5 on a diverse set of image understanding and generation tasks.Our method marks the first successful attempt to enable a frozen LLM to generate image content while surpassing state-of-the-art performance in image understanding tasks, under the same setting, by over 25%.
Box's new AI agents can organize, find, and extract data from documents for you
AI agents, as you've probably noticed, are all the rage in Silicon Valley. On Thursday, the content management platform Box joined a growing list of companies hoping to cash in on this latest tech trend. The new Box AI Agents are designed to help enterprise customers organize and retrieve critical information from files across the platform. Also: 100 leading AI scientists map route to more'trustworthy, reliable, secure' AI Like many new "agentic" products, the agents are promoted as time-saving tools that enterprise customers can harness to reduce mundane tasks that tend to eat up large chunks of employees' workdays, like summarizing HR forms or pulling key details from lengthy contracts. The agents are being released as part of Box AI, the company's AI-powered content management tool, which debuted in late 2023.
AI PCs rely on NPUs. So what exactly are these newfangled chips?
CPUs and GPUs are old news. These days, the cutting edge is all about NPUs, and hardware manufacturers are talking up NPU performance. The NPU is a computer component designed to accelerate AI tasks in a power-efficient manner, paving the way for new Windows desktop applications with powerful AI features. All PCs will eventually have NPUs, but at the moment only some laptops have them. Here's everything you need to know about NPUs and why they're such a hot topic in the computer industry right now.
GPT-4.1 makes ChatGPT smarter, faster, and more useful for paying users, especially coders
OpenAI is now bringing GPT-4.1 to the Plus, Pro, and Team tiers of ChatGPT. GPT-4.1 was previously available only to API users. Since I'm throwing a whole lot of buzzwords at you, let's spend a minute deconstructing all these terms. OK, so that should bring you up to speed. Back in April, OpenAI released GPT-4.1 for developers to use via the API.
The Download: Montana's experimental treatments, and Google DeepMind's new AI agent
The news: A bill that allows clinics to sell unproven treatments has been passed in Montana. Under the legislation, doctors can apply for a license to open an experimental treatment clinic and recommend and sell therapies not approved by the Food and Drug Administration (FDA) to their patients. Why it matters: Once it's signed by the governor, the law will be the most expansive in the country in allowing access to drugs that have not been fully tested. The bill allows for any drug produced in the state to be sold in it, providing it has been through phase I clinical trials--but these trials do not determine if the drug is effective. The big picture: The bill was drafted and lobbied for by people interested in extending human lifespans.
LaSCal: Label-Shift Calibration without target labels
When machine learning systems face dataset shift, model calibration plays a pivotal role in ensuring their reliability.Calibration error (CE) provides insights into the alignment between the predicted confidence scores and the classifier accuracy.While prior works have delved into the implications of dataset shift on calibration, existing CE estimators either (i) assume access to labeled data from the target domain, often unavailable in practice, or (ii) are derived under a covariate shift assumption.In this work we propose a novel, label-free, consistent CE estimator under label shift. Label shift is characterized by changes in the marginal label distribution p(Y), with a constant conditional p(X Y) distribution between the source and target. We introduce a novel calibration method, called LaSCal, which uses the estimator in conjunction with a post-hoc calibration strategy, to perform unsupervised calibration on the target distribution. Our thorough empirical analysis demonstrates the effectiveness and reliability of the proposed approach across different modalities, model architectures and label shift intensities.