Probing the topology of the space of tokens with structured prompts

Robinson, Michael, Dey, Sourya, Kushner, Taisa

Mar-19-2025–arXiv.org Artificial Intelligence

The set of tokens T, when embedded within the latent space X of a large language model (LLM) can be thought of as a finite sample drawn from a distribution supported on a topological subspace of X. One can ask what the smallest (in the sense of inclusion) subspace and simplest (in terms of fewest free parameters) distribution can account for such a sample. Previous work[1] suggests that the smallest topological subspace from which tokens can be drawn is not manifold, but has structure consistent with a stratified manifold. That paper relied upon knowing the token input embedding function T X, which given each token t T, ascribes a representation in X. Because embeddings preserve topological structure, in this paper, we will study T by equating it with the image of the token input embedding function, thereby treating T both as the set of tokens and as a subspace of X. This subspace is called the token subspace of X. Usually X is taken to be Euclidean space R

dimension, large language model, natural language, (13 more...)

arXiv.org Artificial Intelligence

Mar-19-2025

arXiv.org PDF

Add feedback

Country:
- North America > United States
  - New York (0.04)
  - District of Columbia > Washington (0.04)
  - Virginia > Arlington County
    - Arlington (0.04)
- Europe > Spain
  - Catalonia > Barcelona Province > Barcelona (0.04)

Genre:
- Research Report (0.52)

Industry:
- Government (0.33)

Technology:
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.58)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found