PLDR-LLMs Learn A Generalizable Tensor Operator That Can Replace Its Own Deep Neural Net At Inference

Feb-22-2025–arXiv.org Artificial Intelligence

February 18, 2025 A BSTRACT We show that Large Language Model from Power Law Decoder Representations (PLDR-LLM) is a foundational model whose deductive outputs are invariant tensors up to a small perturbation. PLDR-LLM learns a singularity condition for the deductive outputs that enable the once-inferred energy-curvature tensor G LMto replace the deep neural network of power law graph attention (PLGA) generating the deductive outputs at inference. We demonstrate that a cache for G LM(G-cache) and KV -cache can be implemented in a straightforward manner to improve the inference time. The invariance and generalizable nature of deductive outputs is at a very high fidelity where deductive outputs have same RMSE and determinant values up to 15 decimal places after caching, and zero-shot benchmark scores remain unchanged. Ablation studies show that learned deductive outputs have distinct loss and accuracy characteristics from models pretrained with transferred, randomly initialized or identity tensors as a constant tensor operator and an LLM with scaled-dot product attention (SDP A) is a special case of PLDR-LLM where G LMis predefined as identity. The observed invariance characteristic introduces a novel asymmetry between training and inference phases with caching. We outline observed common characteristics of the deductive outputs for the learned singularity condition. We provide an implementation of a training and inference framework for PLDR-LLM with KV -cache and G-cache. 1 Introduction Large Language Model from Power Law Decoder Representations (PLDR-LLM) is a novel language model architecture with well-defined deductive and inductive outputs [Gokden, 2024]. It is composed of deep layers of decoders with multi-headed Power Law Graph Attention (PLGA) [Gokden, 2021, 2019]. The deductive outputs are intended to observe and regularize the model, while the inductive output is the next-token prediction of a language model. PLGA is a series of non-linear and linear transformations that attend to an input sentence that can be considered as a weighted graph G = ( V, E) where nodes are the tokens densely represented by an N-dimensional embedding space. The PLGA learns a metric tensor A LMof the embedding space after applying a custom fully connected layer and iSwiGLU, a positive semi-definite activation function, to the output A of a deep residual network of gated linear units (GLUs) whose input is a density matrix operator derived from the query.

deductive output, generalizable tensor operator, pldr-llm, (12 more...)

arXiv.org Artificial Intelligence

Feb-22-2025

arXiv.org PDF

Add feedback

Country:
- Oceania > Australia
  - Victoria > Melbourne (0.04)
- North America > United States
  - Oregon > Multnomah County
    - Portland (0.04)
  - California > San Francisco County
    - San Francisco (0.04)
- Europe
  - Ireland > Leinster
    - County Dublin > Dublin (0.04)
  - Belgium > Brussels-Capital Region
    - Brussels (0.04)
- Asia > China
  - Hong Kong (0.04)

Genre:
- Research Report (0.64)

Industry:
- Leisure & Entertainment (0.68)
- Media > Film (0.46)

Technology:
- Information Technology > Artificial Intelligence
  - Natural Language > Large Language Model (1.00)
  - Machine Learning > Neural Networks
    - Deep Learning (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found