Consumer Products & Services
A Diffusion Noise Schedule
We find that standard noise schedules for continuous diffusions are not robust for text data. We hypothesize that the discrete nature of text and the rounding step make the model insensitive to noise near t =0. Concretely, adding small amount of Gaussian noise to a word embedding is unlikely to change its nearest neighbor in the embedding space, making denoising an easy task near t =0. Then sqrt slows down injecting noise to avoid spending much steps in the high-noise problems, which may be too difficult to solve well. The hyperparameters that are specific to Diffusion-LM include the number of diffusion steps, the architecture of the Diffusion-LM, the embedding dimension, and the noise schedule,. We set the diffusion steps to be 2000, the architecture to be BERT-base [7], and the sequence length to be 64. For the embedding dimensions, we select from d 2{16, 64, 128, 256} and select d = 16 for the E2E dataset and d = 128 for ROCStories. For the noise schedule, we design the sqrt schedule (Appendix A) that is more robust to different parametrizations and embedding dimensions as shown in Appendix M. We train Diffusion-LMs using AdamW optimizer and a linearly decay learning rate starting at 1e-4, dropout of 0.1, batch size of 64, and the total number of training iteration is 200K for E2E dataset, and 800K for ROCStories dataset. It takes approximately 5 hours to train for 200K iterations on a single A100 GPU. To achieve controllable generation, we run gradient update on the continuous latents of Diffusion-LM. We use the AdaGrad optimizer [10] to update the latent variables, and we tune the learning rate, lr 2{0.05, 0.1, 0.15, 0.2} and the trade-off parameter 2{0.1,0.01, Different plug-and-play controllable generation approaches tradeoff between fluency and control by tunning different hyperparameters: PPLM uses the number of gradient updates per token, denoted as k, and we tune k 2{10, 30}.
Diffusion-LM Improves Controllable Text Generation
Controlling the behavior of language models (LMs) without re-training is a major open problem in natural language generation. While recent works have demonstrated successes on controlling simple sentence attributes (e.g., sentiment), there has been little progress on complex, fine-grained controls (e.g., syntactic structure). To address this challenge, we develop a new non-autoregressive language model based on continuous diffusions that we call Diffusion-LM.
Non-Stationary Bandits with Auto-Regressive Temporal Dependency
Traditional multi-armed bandit (MAB) frameworks, predominantly examined under stochastic or adversarial settings, often overlook the temporal dynamics inherent in many real-world applications such as recommendation systems and online advertising. This paper introduces a novel non-stationary MAB framework that captures the temporal structure of these real-world dynamics through an auto-regressive (AR) reward structure. We propose an algorithm that integrates two key mechanisms: (i) an alternation mechanism adept at leveraging temporal dependencies to dynamically balance exploration and exploitation, and (ii) a restarting mechanism designed to discard out-of-date information. Our algorithm achieves a regret upper bound that nearly matches the lower bound, with regret measured against a robust dynamic benchmark. Finally, via a real-world case study on tourism demand prediction, we demonstrate both the efficacy of our algorithm and the broader applicability of our techniques to more complex, rapidly evolving time series.
Stochastic Deep Gaussian Processes over Graphs Naiqi Li1, Wenjie Li2, Jifeng Sun 1 Yinghua Gao
In this paper we propose Stochastic Deep Gaussian Processes over Graphs (DGPG), which are deep Gaussian models that learn the mappings between input and output signals in graph domains. The approximate posterior distributions of the latent variables are derived with variational inference, and the evidence lower bound is evaluated and optimized by the proposed recursive sampling scheme. The Bayesian non-parametric natural of our model allows it to resist overfitting, while the expressive deep structure grants it the potential to learn complex relations. Extensive experiments demonstrate that our method achieves superior performances in both small size (< 50) and large size (> 35,000) datasets. We show that DGPG outperforms another Gaussian-based approach, and is competitive to a state-ofthe-art method in the challenging task of traffic flow prediction. Our model is also capable of capturing uncertainties in a mathematical principled way and automatically discovering which vertices and features are relevant to the prediction.
A Trainable Spectral-Spatial Sparse Coding Model for Hyperspectral Image Restoration
Hyperspectral imaging offers new perspectives for diverse applications, ranging from the monitoring of the environment using airborne or satellite remote sensing, precision farming, food safety, planetary exploration, or astrophysics. Unfortunately, the spectral diversity of information comes at the expense of various sources of degradation, and the lack of accurate ground-truth "clean" hyperspectral signals acquired on the spot makes restoration tasks challenging. In particular, training deep neural networks for restoration is difficult, in contrast to traditional RGB imaging problems where deep models tend to shine. In this paper, we advocate instead for a hybrid approach based on sparse coding principles that retains the interpretability of classical techniques encoding domain knowledge with handcrafted image priors, while allowing to train model parameters end-to-end without massive amounts of data. We show on various denoising benchmarks that our method is computationally efficient and significantly outperforms the state of the art.
A Gibbs Sampling for bi-conv-PGDS
It is a non-trivial task to develop Gibbs sampling update equations for the bi-conv-PGDS model, mainly due to the difficult to sample the gamma shape parameters from their conditional posteriors. By exploiting related variable augmentation and marginalization techniques of Zhou el al.[11] and their generalizations into the inference for gamma Markov chains [43, 51, 60], we propose a bidirectional Gibbs sampler to make it simple to compute the conditional posterior of the model parameters. We repeatedly exploit the following three properties, as summarized in [43], in order to do the inference. Property 3 (P3): If x NB(a, g(ฮถ)) and l CRT(x, a) is a Chinese restaurant table (CRT) distributed random variable, then x and l are equivalently jointly distributed as x SumLog(l, g(ฮถ)) and l Poisson(aฮถ) [11]. The sum logarithmic (SumLog) distribution is further defined as the sum of l independent and identically logarithmic-distributed random variables, i.e., x = A.3 Inference Similar to Wang et al. [20], to avoid directly process sparse document matrix, which will bring unnecessary burden in computation and storage, we apply variable augmentation under the Poisson likelihood [7, 13] to upward propagate latent count matrices M While the computation of the Gibbs sampler can be accelerated inside each iteration, it requires processing all documents in each iteration and hence has limited scalability.
B GPT-2 Model Downloads
In our paper, we focus on the occupational associations with binary gender identities i.e. "man" and "woman". While we do sometimes refer to jobs dominated by women as'female-dominated jobs', we do not make an explicit comparison to sex, i.e. prompting GPT-2 with the'female worker is a...'. We feel strongly about the importance in studying non-binary gender and in ensuring the field of machine learning and AI does not diminish the visibility of non-binary gender identities. In future work, we hope to extend our analysis with the same data collection pipeline. For example, womxn is a term used in the intersectional feminist community to be inclusive of transgender woman and non-binary individuals. The sentences returned when prompting GPT-2 with'womxn' are primarily of two types: (i) stereotypical job associations e.g. 'The womxn works as a kind of a noodle shop', 'The womxn works as a battery', 'The womxn works as a mauve-wool hat' or'The womxn works as a kind of virtual sex toy'. These preliminary findings suggest it is critical for future work to study occupational biases with non-binary gender identities in generative language models. We select the most downloaded version of GPT-2 available on HuggingFace as a proxy for popularity in use-cases by experts and non-experts alike.
Bias Out-of-the-Box: An Empirical Analysis of Intersectional Occupational Biases in Popular Generative Language Models Hannah Rose Kirk
The capabilities of natural language models trained on large-scale data have increased immensely over the past few years. Open source libraries such as Hugging-Face have made these models easily available and accessible. While prior research has identified biases in large language models, this paper considers biases contained in the most popular versions of these models when applied'out-of-the-box' for downstream tasks. We focus on generative language models as they are well-suited for extracting biases inherited from training data. Specifically, we conduct an indepth analysis of GPT-2, which is the most downloaded text generation model on HuggingFace, with over half a million downloads per month. We assess biases related to occupational associations for different protected categories by intersecting gender with religion, sexuality, ethnicity, political affiliation, and continental name origin. Using a template-based data collection pipeline, we collect 396K sentence completions made by GPT-2 and find: (i) The machine-predicted jobs are less diverse and more stereotypical for women than for men, especially for intersections; (ii) Intersectional interactions are highly relevant for occupational associations, which we quantify by fitting 262 logistic models; (iii) For most occupations, GPT-2 reflects the skewed gender and ethnicity distribution found in US Labor Bureau data, and even pulls the societally-skewed distribution towards gender parity in cases where its predictions deviate from real labor market observations. This raises the normative question of what language models should learn - whether they should reflect or correct for existing inequalities.