Probing the topology of the space of tokens with structured prompts
Robinson, Michael, Dey, Sourya, Kushner, Taisa
–arXiv.org Artificial Intelligence
The set of tokens T, when embedded within the latent space X of a large language model (LLM) can be thought of as a finite sample drawn from a distribution supported on a topological subspace of X. One can ask what the smallest (in the sense of inclusion) subspace and simplest (in terms of fewest free parameters) distribution can account for such a sample. Previous work[1] suggests that the smallest topological subspace from which tokens can be drawn is not manifold, but has structure consistent with a stratified manifold. That paper relied upon knowing the token input embedding function T X, which given each token t T, ascribes a representation in X. Because embeddings preserve topological structure, in this paper, we will study T by equating it with the image of the token input embedding function, thereby treating T both as the set of tokens and as a subspace of X. This subspace is called the token subspace of X. Usually X is taken to be Euclidean space R
arXiv.org Artificial Intelligence
Mar-19-2025
- Country:
- Europe > Spain
- Catalonia > Barcelona Province > Barcelona (0.04)
- North America > United States
- District of Columbia > Washington (0.04)
- New York (0.04)
- Virginia > Arlington County
- Arlington (0.04)
- Europe > Spain
- Genre:
- Research Report (0.52)
- Industry:
- Government (0.33)
- Technology: