Goto

Collaborating Authors

 decoding


AutoJudge: Judge Decoding Without Manual Annotation

Neural Information Processing Systems

We introduce AutoJudge1, a method that accelerates large language model (LLM) inference with task-specific lossy speculative decoding. Instead of matching the original model output distribution token-by-token, we identify the generated tokens that affect the downstream quality of the response, relaxing the distribution match guarantee so that the "unimportant" tokens can be generated faster. Our approach relies on a semi-greedy search algorithm to test which of the mismatches between target and draft models should be corrected to preserve quality and which ones may be skipped. We then train a lightweight classifier based on existing LLM embeddings to predict, at inference time, which mismatching tokens can be safely accepted without compromising the final answer quality. We evaluate AutoJudge with multiple draft/target model pairs on mathematical reasoning and programming benchmarks, achieving significant speedups at the cost of a minor accuracy reduction. Notably, on GSM8K with the Llama 3.1 70B target model, our approach achieves up to 2 speedup over speculative decoding at the cost of a 1% drop in accuracy. When applied to the LiveCodeBench benchmark, AutoJudge automatically detects programming-specific important tokens, accepting 25 tokens per speculation cycle at a 2% drop in Pass@1. Our approach requires no human annotation and is easy to integrate with modern LLM inference frameworks.


Bridging Brains and Concepts: Interpretable Visual Decoding from fMRI with Semantic Bottlenecks

Neural Information Processing Systems

Decoding of visual stimuli from noninvasive neuroimaging techniques such as functional magnetic resonance (fMRI) has advanced rapidly in the last years; yet, most high-performing brain decoding models rely on complicated, non-interpretable latent spaces. In this study we present an interpretable brain decoding framework that inserts a semantic bottleneck into BrainDiffuser, a well established, simple and linear decoding pipeline. We firstly produce a 214 dimensional binary interpretable space L for images, in which each dimension answers to a specific question about the image (e.g., "Is there a person?",


Speculative Jacobi-Denoising Decoding for Accelerating Autoregressive Text-to-image Generation

Neural Information Processing Systems

As a new paradigm of visual content generation, autoregressive text-to-image models suffer from slow inference due to their sequential token-by-token decoding process, often requiring thousands of model forward passes to generate a single image. To address this inefficiency, we propose Speculative Jacobi-Denoising Decoding (SJD2), a framework that incorporates the denoising process into Jacobi iterations to enable parallel token generation in autoregressive models. Our method introduces a next-clean-token prediction paradigm that enables the pre-trained autoregressive models to accept noise-perturbed token embeddings and predict the next clean tokens through low-cost fine-tuning.


Distribution-Aligned Decoding for Efficient LLMTask Adaptation

Neural Information Processing Systems

Adapting billion-parameter language models to a downstream task is still costly, even with parameter-efficient fine-tuning (PEFT). We re-cast task adaptation as output-distribution alignment: the objective is to steer the output distribution toward the task distribution directly during decoding rather than indirectly through weight updates. Building on this view, we introduce Steering Vector Decoding (SVDecode), a lightweight, PEFT-compatible, and theoretically grounded method. We start with a short warm-start fine-tune and extract a task-aware steering vector from the Kullback-Leibler (KL) divergence gradient between the output distribution of the warm-started and pre-trained models. This steering vector is then used to guide the decoding process to steer the model's output distribution towards the task distribution. We theoretically prove that SVDecode is first-order equivalent to the gradient step of full fine-tuning and derive a globally optimal solution for the strength of the steering vector. Across three tasks and nine benchmarks, SVDecode paired with four standard PEFT methods improves multiple-choice accuracy by up to 5 percentage points and open-ended truthfulness by 2 percentage points, with similar gains (1-2 percentage points) on commonsense datasets without adding trainable parameters beyond the PEFT adapter. SVDecode thus offers a lightweight, theoretically grounded path to stronger task adaptation for large language models.


AR-RAG: Autoregressive Retrieval Augmentation for Image Generation

Neural Information Processing Systems

W paradigm e introduce that enhances Autoregressi image ve Retrie generation val Augmentation by autoregressi ( v A ely R-R incorporating AG), a novel knearest neighbor retrievals at the patch level. Unlike prior methods that perform a fix single, ed reference static retrie images, val before AR-RA generation G performs and conte condition xt-aware the retrie entire vals generation at each genon eration step, using prior-generated patches as queries to retrieve and incorporate the evolving most rele generation vant patch-le needs vel while visual avoiding references, limitations enabling (e.g., the o model ver-cop to ying, respond stylisto tic bias, etc.) prevalent in existing methods. To realize AR-RAG, we propose two parallel frameworks: (1) Distribution-Augmentation in Decoding (DAiD), a tion training-free of model-predicted plug-and-use patches decoding with the strate distrib gy that ution directly of retrie mer v ges ed patches, the distrib and u(2) Feature-Augmentation in Decoding (FAiD), a parameter-efficient fine-tuning method convolution that progressi operations vely and smooths leverages the them features to augment of retriev the ed patches image generation via multi-scale process.


Bridging Brains and Concepts: Interpretable Visual Decoding from fMRI with Semantic Bottlenecks

Neural Information Processing Systems

Decoding of visual stimuli from noninvasive neuroimaging techniques such as functional magnetic resonance (fMRI) has advanced rapidly in the last years; yet, most high-performing brain decoding models rely on complicated, non-interpretable latent spaces. In this study we present an interpretable brain decoding framework that inserts a semantic bottleneck into BrainDiffuser, a well established, simple and linear decoding pipeline. We firstly produce a $214-\text{dimensional}$ binary interpretable space $\mathcal{L}$ for images, in which each dimension answers to a specific question about the image (e.g., Is there a person?, Is it outdoors?).


Distribution-Aligned Decoding for Efficient LLM Task Adaptation

Neural Information Processing Systems

Adapting billion-parameter language models to a downstream task is still costly, even with parameter-efficient fine-tuning (PEFT). We re-cast task adaptation as output-distribution alignment: the objective is to steer the output distribution toward the task distribution directly during decoding rather than indirectly through weight updates. Building on this view, we introduce Steering Vector Decoding (SVDecode), a lightweight, PEFT-compatible, and theoretically grounded method. We start with a short warm-start fine-tune and extract a task-aware steering vector from the Kullback-Leibler (KL) divergence gradient between the output distribution of the warm-started and pre-trained models. This steering vector is then used to guide the decoding process to steer the model's output distribution towards the task distribution. We theoretically prove that SVDecode is first-order equivalent to the gradient step of full fine-tuning and derive a globally optimal solution for the strength of the steering vector. Across three tasks and nine benchmarks, SVDecode paired with four standard PEFT methods improves multiple-choice accuracy by up to 5 percentage points and open-ended truthfulness by 2 percentage points, with similar gains (1-2 percentage points) on commonsense datasets without adding trainable parameters beyond the PEFT adapter. SVDecode thus offers a lightweight, theoretically grounded path to stronger task adaptation for large language models.


AR-RAG: Autoregressive Retrieval Augmentation for Image Generation

Neural Information Processing Systems

We introduce Autoregressive Retrieval Augmentation (AR-RAG), a novel paradigm that enhances image generation by autoregressively incorporating k-nearest neighbor retrievals at the patch level. Unlike prior methods that perform a single, static retrieval before generation and condition the entire generation on fixed reference images, AR-RAG performs context-aware retrievals at each generation step, using prior-generated patches as queries to retrieve and incorporate the most relevant patch-level visual references, enabling the model to respond to evolving generation needs while avoiding limitations (e.g., over-copying, stylistic bias, etc.) prevalent in existing methods. To realize AR-RAG, we propose two parallel frameworks: (1) Distribution-Augmentation in Decoding (DAiD), a training-free plug-and-use decoding strategy that directly merges the distribution of model-predicted patches with the distribution of retrieved patches, and (2) Feature-Augmentation in Decoding (FAiD), a parameter-efficient fine-tuning method that progressively smooths the features of retrieved patches via multi-scale convolution operations and leverages them to augment the image generation process. We validate the effectiveness of AR-RAG on widely adopted benchmarks, including Midjourney-30K, GenEval and DPG-Bench, demonstrating significant performance gains over state-of-the-art image generation models.



Supplementary Material 1 Decoding using automatic differentiation inference ADVI

Neural Information Processing Systems

In the method section of our paper, we describe the general encoding-decoding paradigm. We provide a brief overview of our data preprocessing pipeline, which involves the following steps. We employ the method of Boussard et al. (2021) to estimate the location of Decentralized registration (Windolf et al., 2022) is applied to track and correct Figure 6: Motion drift in "good" and "bad" sorting recordings. "bad" sorting example, which is still affected by drift even after registration. To decode binary behaviors, such as the mouse's left or right choices, we utilize In this section, we provide visualizations to gain insights into the effectiveness of our proposed decoder.