boldsymbol theta
Tutorial #17: Transformers III Training
In part I of this tutorial we introduced the self-attention mechanism and the transformer architecture. In part II, we discussed position encoding and how to extend the transformer to longer sequence lengths. We also discussed connections between the transformer and other machine learning models. In this final part, we discuss challenges with transformer training dynamics and introduce some of the tricks that practitioners use to get transformers to converge. This discussion will be suitable for researchers who already understand the transformer architecture, and who are interested in training transformers and similar models from scratch. Despite their broad applications, transformers are surprisingly difficult to train from scratch. The input consists of a $I\times D$ matrix containing the $D$ dimensional embeddings for each of the $I$ input tokens.
Deep Learning for COVID-19 Diagnosis
Over the last several months, the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) has rapidly become a global pandemic, resulting in nearly 480,000 COVID-19 related deaths as of June 25, 2020 [6]. While the disease can manifest in a variety of ways--ranging from asymptomatic conditions or flu-like symptoms to acute respiratory distress syndrome--the most common presentation associated with morbidity and mortality is the presence of opacities and consolidation in a patient's lungs. Upon inhalation, the virus attacks and inhibits the lungs' alveoli, which are responsible for oxygen exchange. This opacification is visible on computed tomography (CT) scans. Due to their increased densities, these areas appear as partially opaque regions with increased attenuation, which is known as a ground-glass opacity (GGO).
Tutorial #5: variational autoencoders
The goal of the variational autoencoder (VAE) is to learn a probability distribution $Pr(\mathbf{x})$ over a multi-dimensional variable $\mathbf{x}$. There are two main reasons for modelling distributions. First, we might want to draw samples (generate) from the distribution to create new plausible values of $\mathbf{x}$. Second, we might want to measure the likelihood that a new vector $\mathbf{x} {*}$ was created by this probability distribution. In fact, it turns out that the variational autoencoder is well-suited to the former task but not for the latter. It is common to talk about the variational autoencoder as if it is the model of $Pr(\mathbf{x})$. However, this is misleading; the variational autoencoder is a neural architecture that is designed to help learn the model for $Pr(\mathbf{x})$.
Visualizing the gradient descent method
In the gradient descent method of optimization, a hypothesis function, h_\boldsymbol{\theta}(x), is fitted to a data set, (x {(i)}, y {(i)}) ( i 1,2,\cdots,m) by minimizing an associated cost function, J(\boldsymbol{\theta}) in terms of the parameters \boldsymbol\theta \theta_0, \theta_1, \cdots . The cost function describes how closely the hypothesis fits the data for a given choice of \boldsymbol \theta . For example, one might wish to fit a given data set to a straight line, h_\boldsymbol{\theta}(x) \theta_0 \theta_1 x. To simplify things, consider fitting a data set to a straight line through the origin: h_\theta(x) \theta_1 x . In this one-dimensional problem, we can plot a simple graph for J(\theta_1) and follow the iterative procedure which trys to converge on its minimum.