PLDR-LLMs Learn A Generalizable Tensor Operator That Can Replace Its Own Deep Neural Net At Inference
–arXiv.org Artificial Intelligence
February 18, 2025 A BSTRACT We show that Large Language Model from Power Law Decoder Representations (PLDR-LLM) is a foundational model whose deductive outputs are invariant tensors up to a small perturbation. PLDR-LLM learns a singularity condition for the deductive outputs that enable the once-inferred energy-curvature tensor G LMto replace the deep neural network of power law graph attention (PLGA) generating the deductive outputs at inference. We demonstrate that a cache for G LM(G-cache) and KV -cache can be implemented in a straightforward manner to improve the inference time. The invariance and generalizable nature of deductive outputs is at a very high fidelity where deductive outputs have same RMSE and determinant values up to 15 decimal places after caching, and zero-shot benchmark scores remain unchanged. Ablation studies show that learned deductive outputs have distinct loss and accuracy characteristics from models pretrained with transferred, randomly initialized or identity tensors as a constant tensor operator and an LLM with scaled-dot product attention (SDP A) is a special case of PLDR-LLM where G LMis predefined as identity. The observed invariance characteristic introduces a novel asymmetry between training and inference phases with caching. We outline observed common characteristics of the deductive outputs for the learned singularity condition. We provide an implementation of a training and inference framework for PLDR-LLM with KV -cache and G-cache. 1 Introduction Large Language Model from Power Law Decoder Representations (PLDR-LLM) is a novel language model architecture with well-defined deductive and inductive outputs [Gokden, 2024]. It is composed of deep layers of decoders with multi-headed Power Law Graph Attention (PLGA) [Gokden, 2021, 2019]. The deductive outputs are intended to observe and regularize the model, while the inductive output is the next-token prediction of a language model. PLGA is a series of non-linear and linear transformations that attend to an input sentence that can be considered as a weighted graph G = ( V, E) where nodes are the tokens densely represented by an N-dimensional embedding space. The PLGA learns a metric tensor A LMof the embedding space after applying a custom fully connected layer and iSwiGLU, a positive semi-definite activation function, to the output A of a deep residual network of gated linear units (GLUs) whose input is a density matrix operator derived from the query.
arXiv.org Artificial Intelligence
Feb-22-2025
- Country:
- Oceania > Australia
- North America > United States
- Oregon > Multnomah County
- Portland (0.04)
- California > San Francisco County
- San Francisco (0.04)
- Oregon > Multnomah County
- Europe
- Ireland > Leinster
- County Dublin > Dublin (0.04)
- Belgium > Brussels-Capital Region
- Brussels (0.04)
- Ireland > Leinster
- Asia > China
- Hong Kong (0.04)
- Genre:
- Research Report (0.64)
- Industry:
- Leisure & Entertainment (0.68)
- Media > Film (0.46)
- Technology: