LLM Pretraining with Continuous Concepts

Tack, Jihoon, Lanchantin, Jack, Yu, Jane, Cohen, Andrew, Kulikov, Ilia, Lan, Janice, Hao, Shibo, Tian, Yuandong, Weston, Jason, Li, Xian

Feb-12-2025–arXiv.org Artificial Intelligence

Recent progress in large language models (LLMs) has revolutionized natural language processing (Brown et al., 2020; Dubey et al., 2024) and thus became a core technology in various real-world applications, such as coding assistants (Roziere et al., 2023), search engines (Xuan-Quy et al., 2023), and personal AI assistants (Gao et al., 2023). Central to these breakthroughs is the simple paradigm of next token prediction, which leverages massive amounts of unlabeled text to uncover rich linguistic patterns (Radford et al., 2018, 2019). However, natural language tokens are often superficial (e.g., function words like "the" or "a"), necessitating substantial training for models to acquire high-level reasoning and conceptual understanding while also hindering their ability to tackle long-horizon tasks such as planning (LeCun, 2022; Bachmann and Nagarajan, 2024). To tackle this issue, recent studies have investigated methods that go beyond token-level signals by leveraging richer information to train models. For instance, some approaches target more expressive prediction objectives, such as predicting multiple tokens at once to better capture semantic relationships (Gloeckle et al., 2024; DeepSeek-AI, 2024), while others augment the input with rich signals, e.g., self-generated thought tokens (Zelikman et al., 2024), or fixed pause tokens (Goyal et al., 2024) prior to next token prediction. Moreover, emerging evidence suggests that LLMs inherently encode high-level concepts and reasoning processes in their latent representations (Deng et al., 2023; Yang et al., 2024), indicating replacing discrete language tokens with continuous latent representations has promise in improving reasoning efficiency (Hao et al., 2024). While token-level modeling remains important for coherent text generation, the key challenge is to enrich or supplement these natural language tokens so that LLMs can learn more abstract reasoning abilities and long-range dependencies. This raises a key question: can we augment the next token prediction objective to explicitly model concepts in a latent representation space, thereby bridging semantic abstraction and fine-grained token-level guidance? To this end, we draw inspiration from recent findings that Sparse Autoencoders (SAEs) can effectively isolate meaningful latent features in LLMs by capturing the high-level semantic concepts (Cunningham et al., 2023;

large language model, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

Feb-12-2025

arXiv.org PDF

Add feedback

Country:
- North America > United States (0.14)

Genre:
- Research Report > New Finding (0.46)

Industry:
- Education (0.46)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning > Neural Networks
    - Deep Learning (1.00)
  - Natural Language > Large Language Model (1.00)