ToDMA: Large Model-Driven Token-Domain Multiple Access for Semantic Communications
Qiao, Li, Mashhadi, Mahdi Boloursaz, Gao, Zhen, Schober, Robert, Gündüz, Deniz
–arXiv.org Artificial Intelligence
--T oken communications (T okCom) is an emerging generative semantic communication concept that reduces transmission rates by using context and multimodal large language model (MLLM)-based token processing, with tokens serving as universal semantic units across modalities. In this paper, we propose a semantic multiple access scheme in the token domain, referred to as token domain multiple access (T oDMA), where a large number of devices share a token codebook and a modulation codebook for source and channel coding, respectively. Specifically, each transmitter first tokenizes its source signal and modulate each token to a codeword. At the receiver, compressed sensing is employed first to detect active tokens and the corresponding channel state information (CSI) from the superposed signals. Then, the source token sequences are reconstructed by clustering the token-associated CSI across multiple time slots. In case of token collisions, some active tokens cannot be assigned and some positions in the reconstructed token sequences are empty. We propose to use pre-trained MLLMs to leverage the context, predict masked tokens, and thus mitigate token collisions. Simulation results demonstrate the effectiveness of the proposed T oDMA framework for both text and image transmission tasks, achieving significantly lower latency compared to context-unaware orthogonal communication schemes, while also delivering superior distortion and perceptual quality compared to state-of-the-art context-unaware non-orthogonal communication methods. The rise of multimodal large language models (MLLMs) marks a significant breakthrough in artificial intelligence (AI), combining the strengths of large language models (LLMs) with the ability to process and integrate different modalities of data--such as text, images, video, and audio [2]. MLLMs, such as GPT -4 Omni [3], BLIP-2 [4], LLaV a [5], and others, enable models to handle tasks that require understanding across different modalities, such as generating descriptive captions for images, answering questions based on visual content, or even creating high-quality multimodal content. Part of the work was accepted by IEEE INFOCOM 2025 Workshop [1]. D. G und uz is with the Department of Electrical and Electronic Engineering, Imperial College London, London SW7 2AZ, U.K. (email: d.gunduz@imperial.ac.uk).
arXiv.org Artificial Intelligence
Sep-11-2025
- Country:
- Asia > China
- Beijing > Beijing (0.04)
- Shandong Province (0.04)
- Europe
- Germany > Bavaria
- Middle Franconia > Nuremberg (0.04)
- United Kingdom > England
- Germany > Bavaria
- Asia > China
- Genre:
- Research Report (0.84)
- Industry:
- Education > Educational Setting > Higher Education (0.34)
- Technology: