Discrete Diffusion Language Modeling by Estimating the Ratios of the Data Distribution
Lou, Aaron, Meng, Chenlin, Ermon, Stefano
Despite their groundbreaking performance for many generative modeling tasks, diffusion models have fallen short on discrete data domains such as natural language. Crucially, standard diffusion models rely on the well-established theory of score matching, but efforts to generalize this to discrete structures have not yielded the same empirical gains. In this work, we bridge this gap by proposing score entropy, a novel discrete score matching loss that is more stable than existing methods, forms an ELBO for maximum likelihood training, and can be efficiently optimized with a denoising variant. We scale our Score Entropy Discrete Diffusion models (SEDD) to the experimental setting of GPT-2, achieving highly competitive likelihoods while also introducing distinct algorithmic advantages. In particular, when comparing similarly sized SEDD and GPT-2 models, SEDD attains comparable perplexities (normally within $+10\%$ of and sometimes outperforming the baseline). Furthermore, SEDD models learn a more faithful sequence distribution (around $4\times$ better compared to GPT-2 models with ancestral sampling as measured by large models), can trade off compute for generation quality (needing only $16\times$ fewer network evaluations to match GPT-2), and enables arbitrary infilling beyond the standard left to right prompting.
Oct-25-2023
- Country:
- South America > Argentina
- Pampas > Buenos Aires F.D. > Buenos Aires (0.04)
- North America
- Mexico (0.04)
- United States
- North Carolina (0.04)
- New York (0.04)
- Kansas (0.04)
- Texas (0.04)
- Pennsylvania (0.04)
- Delaware (0.04)
- Arkansas (0.04)
- California
- Santa Clara County > Palo Alto (0.04)
- San Francisco County > San Francisco (0.04)
- San Bernardino County > Hesperia (0.04)
- Los Angeles County > Los Angeles (0.04)
- Europe
- Poland (0.14)
- Germany (0.04)
- United Kingdom > England (0.04)
- France (0.04)
- Russia (0.04)
- Greece (0.04)
- Italy > Sicily (0.04)
- Spain > Galicia
- Madrid (0.04)
- Netherlands > South Holland
- The Hague (0.04)
- Asia
- Russia (0.04)
- India (0.04)
- East Asia (0.04)
- China (0.04)
- Middle East
- Jordan (0.04)
- Palestine > Gaza Strip
- Gaza Governorate > Gaza (0.04)
- Indonesia > Java
- Africa > Middle East
- Morocco (0.04)
- South America > Argentina
- Genre:
- Research Report (1.00)
- Industry:
- Law Enforcement & Public Safety > Crime Prevention & Enforcement (1.00)
- Banking & Finance > Economy (1.00)
- Law > Civil Rights & Constitutional Law (0.92)
- Information Technology > Security & Privacy (0.92)
- Health & Medicine
- Consumer Health (0.67)
- Therapeutic Area
- Neurology (0.93)
- Infections and Infectious Diseases (0.67)
- Government
- Technology: