part 4
Local Convergence of Approximate Newton Method for Two Layer Nonlinear Regression
Li, Zhihang, Song, Zhao, Wang, Zifan, Yin, Junze
There have been significant advancements made by large language models (LLMs) in various aspects of our daily lives. LLMs serve as a transformative force in natural language processing, finding applications in text generation, translation, sentiment analysis, and question-answering. The accomplishments of LLMs have led to a substantial increase in research efforts in this domain. One specific two-layer regression problem has been well-studied in prior works, where the first layer is activated by a ReLU unit, and the second layer is activated by a softmax unit. While previous works provide a solid analysis of building a two-layer regression, there is still a gap in the analysis of constructing regression problems with more than two layers. In this paper, we take a crucial step toward addressing this problem: we provide an analysis of a two-layer regression problem. In contrast to previous works, our first layer is activated by a softmax unit. This sets the stage for future analyses of creating more activation functions based on the softmax function. Rearranging the softmax function leads to significantly different analyses. Our main results involve analyzing the convergence properties of an approximate Newton method used to minimize the regularized training loss. We prove that the loss function for the Hessian matrix is positive definite and Lipschitz continuous under certain assumptions. This enables us to establish local convergence guarantees for the proposed training algorithm. Specifically, with an appropriate initialization and after $O(\log(1/\epsilon))$ iterations, our algorithm can find an $\epsilon$-approximate minimizer of the training loss with high probability. Each iteration requires approximately $O(\mathrm{nnz}(C) + d^\omega)$ time, where $d$ is the model size, $C$ is the input matrix, and $\omega < 2.374$ is the matrix multiplication exponent.
A Theoretical Insight into Attack and Defense of Gradient Leakage in Transformer
Li, Chenyang, Song, Zhao, Wang, Weixin, Yang, Chiwun
The Deep Leakage from Gradient (DLG) attack has emerged as a prevalent and highly effective method for extracting sensitive training data by inspecting exchanged gradients. This approach poses a substantial threat to the privacy of individuals and organizations alike. This research presents a comprehensive analysis of the gradient leakage method when applied specifically to transformer-based models. Through meticulous examination, we showcase the capability to accurately recover data solely from gradients and rigorously investigate the conditions under which gradient attacks can be executed, providing compelling evidence. Furthermore, we reevaluate the approach of introducing additional noise on gradients as a protective measure against gradient attacks. To address this, we outline a theoretical proof that analyzes the associated privacy costs within the framework of differential privacy. Additionally, we affirm the convergence of the Stochastic Gradient Descent (SGD) algorithm under perturbed gradients. The primary objective of this study is to augment the understanding of gradient leakage attack and defense strategies while actively contributing to the development of privacy-preserving techniques specifically tailored for transformer-based models. By shedding light on the vulnerabilities and countermeasures associated with gradient leakage, this research aims to foster advancements in safeguarding sensitive data and upholding privacy in the context of transformer-based models.
A Unified Scheme of ResNet and Softmax
Song, Zhao, Wang, Weixin, Yin, Junze
Large language models (LLMs) have brought significant changes to human society. Softmax regression and residual neural networks (ResNet) are two important techniques in deep learning: they not only serve as significant theoretical components supporting the functionality of LLMs but also are related to many other machine learning and theoretical computer science fields, including but not limited to image classification, object detection, semantic segmentation, and tensors. Previous research works studied these two concepts separately. In this paper, we provide a theoretical analysis of the regression problem: $\| \langle \exp(Ax) + A x , {\bf 1}_n \rangle^{-1} ( \exp(Ax) + Ax ) - b \|_2^2$, where $A$ is a matrix in $\mathbb{R}^{n \times d}$, $b$ is a vector in $\mathbb{R}^n$, and ${\bf 1}_n$ is the $n$-dimensional vector whose entries are all $1$. This regression problem is a unified scheme that combines softmax regression and ResNet, which has never been done before. We derive the gradient, Hessian, and Lipschitz properties of the loss function. The Hessian is shown to be positive semidefinite, and its structure is characterized as the sum of a low-rank matrix and a diagonal matrix. This enables an efficient approximate Newton method. As a result, this unified scheme helps to connect two previously thought unrelated fields and provides novel insight into loss landscape and optimization for emerging over-parameterized neural networks, which is meaningful for future research in deep learning models.
A Fast Optimization View: Reformulating Single Layer Attention in LLM Based on Tensor and SVM Trick, and Solving It in Matrix Multiplication Time
Gao, Yeqi, Song, Zhao, Wang, Weixin, Yin, Junze
Large language models (LLMs) have played a pivotal role in revolutionizing various facets of our daily existence. Solving attention regression is a fundamental task in optimizing LLMs. In this work, we focus on giving a provable guarantee for the one-layer attention network objective function $L(X,Y) = \sum_{j_0 = 1}^n \sum_{i_0 = 1}^d ( \langle \langle \exp( \mathsf{A}_{j_0} x ) , {\bf 1}_n \rangle^{-1} \exp( \mathsf{A}_{j_0} x ), A_{3} Y_{*,i_0} \rangle - b_{j_0,i_0} )^2$. Here $\mathsf{A} \in \mathbb{R}^{n^2 \times d^2}$ is Kronecker product between $A_1 \in \mathbb{R}^{n \times d}$ and $A_2 \in \mathbb{R}^{n \times d}$. $A_3$ is a matrix in $\mathbb{R}^{n \times d}$, $\mathsf{A}_{j_0} \in \mathbb{R}^{n \times d^2}$ is the $j_0$-th block of $\mathsf{A}$. The $X, Y \in \mathbb{R}^{d \times d}$ are variables we want to learn. $B \in \mathbb{R}^{n \times d}$ and $b_{j_0,i_0} \in \mathbb{R}$ is one entry at $j_0$-th row and $i_0$-th column of $B$, $Y_{*,i_0} \in \mathbb{R}^d$ is the $i_0$-column vector of $Y$, and $x \in \mathbb{R}^{d^2}$ is the vectorization of $X$. In a multi-layer LLM network, the matrix $B \in \mathbb{R}^{n \times d}$ can be viewed as the output of a layer, and $A_1= A_2 = A_3 \in \mathbb{R}^{n \times d}$ can be viewed as the input of a layer. The matrix version of $x$ can be viewed as $QK^\top$ and $Y$ can be viewed as $V$. We provide an iterative greedy algorithm to train loss function $L(X,Y)$ up $\epsilon$ that runs in $\widetilde{O}( ({\cal T}_{\mathrm{mat}}(n,n,d) + {\cal T}_{\mathrm{mat}}(n,d,d) + d^{2\omega}) \log(1/\epsilon) )$ time. Here ${\cal T}_{\mathrm{mat}}(a,b,c)$ denotes the time of multiplying $a \times b$ matrix another $b \times c$ matrix, and $\omega\approx 2.37$ denotes the exponent of matrix multiplication.
ABC of Deep Learning (Part 4 of 5)
Network Architecture refers to the overall neural network structure, including the number of layers, the number of neurons in each layer, and the connections between them. There is not a single best network architecture for all problems; the best architecture depends on the specific problem being solved. This is also a very active area of research, with new architectures being constantly proposed. Some wildly successful architectures have been proposed in recent years, such as convolutional neural networks and recurrent neural networks. The most advanced architectures are usually based on a combination of these basic types.
Computer Systems Performance Modeling And Evaluation - Part 4
It is developed in conjunction with the HINT benchmark program. It defines the quality of the solution as a user's final goal. The quality is rigorously defined on the basis of the mathematical characteristics of the problem being solved. Dividing the measure of solution quality by the time taken to achieve that level of quality produces QUIPS. It has several of the characteristics of a good performance metric. The mathematically precise definition of quality for the defined problem makes this metric insensitive to outside influences (characteristic 6) and entirely self-consistent when it is ported to different machines (characteristic 5).
APOSTLE TALK - Future News Now! : THERE'S MORE THAN ARTIFICIAL INTELLIGENCE – PART 4
Scientists have developed software that can look minutes into the future. Gravity in this galaxy [even outside our solar system] behaves as predicted by Albert Einstein's general theory of relativity, confirming the theory's validity on galactic scales. FYI: Our Sun is just ONE STAR among the hundreds of billions of stars in our Milky Way Galaxy. The technology is being built into the official postal system in countries like Mongolia, Ivory Coast and Nigeria, and Mercedes, an investor, is incorporating "What3Words" navigation into its cars. FYI: Each 10-foot-square patch of Planet Earth is labeled with three words -- 57 trillion squares altogether.
Exotic Programming Ideas: Part 4 (Datalog)
Continuing on in our series on exotic programming ideas, we're going to explore the topic of logic programming and a particular form known as datalog. Datalog is executed by a query processor that given these two inputs, finds all instance of facts implied by both the databased and rules. For our examples we're going to be coding our examples in the Souffle language. The namesake of the language is an acronym for the Systematic, Ontological, Undiscovered Fact Finding Logic Engine. Souffle is a minimalist datalog system designed for complex queries over large data sets, such as those encountered in the context of doing static program analysis over large codebases.