PLDR-LLMs Learn A Generalizable Tensor Operator That Can Replace Its Own Deep Neural Net At Inference

Gokden, Burc

arXiv.org Artificial Intelligence 

February 18, 2025 A BSTRACT We show that Large Language Model from Power Law Decoder Representations (PLDR-LLM) is a foundational model whose deductive outputs are invariant tensors up to a small perturbation. PLDR-LLM learns a singularity condition for the deductive outputs that enable the once-inferred energy-curvature tensor G LMto replace the deep neural network of power law graph attention (PLGA) generating the deductive outputs at inference. We demonstrate that a cache for G LM(G-cache) and KV -cache can be implemented in a straightforward manner to improve the inference time. The invariance and generalizable nature of deductive outputs is at a very high fidelity where deductive outputs have same RMSE and determinant values up to 15 decimal places after caching, and zero-shot benchmark scores remain unchanged. Ablation studies show that learned deductive outputs have distinct loss and accuracy characteristics from models pretrained with transferred, randomly initialized or identity tensors as a constant tensor operator and an LLM with scaled-dot product attention (SDP A) is a special case of PLDR-LLM where G LMis predefined as identity. The observed invariance characteristic introduces a novel asymmetry between training and inference phases with caching. We outline observed common characteristics of the deductive outputs for the learned singularity condition. We provide an implementation of a training and inference framework for PLDR-LLM with KV -cache and G-cache. 1 Introduction Large Language Model from Power Law Decoder Representations (PLDR-LLM) is a novel language model architecture with well-defined deductive and inductive outputs [Gokden, 2024]. It is composed of deep layers of decoders with multi-headed Power Law Graph Attention (PLGA) [Gokden, 2021, 2019]. The deductive outputs are intended to observe and regularize the model, while the inductive output is the next-token prediction of a language model. PLGA is a series of non-linear and linear transformations that attend to an input sentence that can be considered as a weighted graph G = ( V, E) where nodes are the tokens densely represented by an N-dimensional embedding space. The PLGA learns a metric tensor A LMof the embedding space after applying a custom fully connected layer and iSwiGLU, a positive semi-definite activation function, to the output A of a deep residual network of gated linear units (GLUs) whose input is a density matrix operator derived from the query.

Duplicate Docs Excel Report

Title
None found

Similar Docs  Excel Report  more

TitleSimilaritySource
None found