AITopics | token generation

Collaborating Authors

token generation

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Streaming Attention Approximation via Discrepancy Theory

Neural Information Processing SystemsJun-11-2026, 00:04:22 GMT

Large language models (LLMs) have achieved impressive success, but their high memory requirements present challenges for long-context token generation. In this paper we study the streaming complexity of attention approximation, a key computational primitive underlying token generation. Our main contribution is BalanceKV, a streaming algorithm for $\epsilon$-approximating attention computations based on geometric process for selecting a balanced collection of Key and Value tokens as per Banaszczyk's vector balancing theory. We complement our algorithm with space lower bounds for streaming attention computation. Besides strong theoretical guarantees, BalanceKV exhibits empirically validated performance improvements over existing methods, both for attention approximation and end-to-end performance on various long context benchmarks.

artificial intelligence, large language model, natural language, (7 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.62)

Add feedback

Words That Make Language Models Perceive

Wang, Sophie L., Isola, Phillip, Cheung, Brian

arXiv.org Artificial IntelligenceOct-6-2025

Large language models (LLMs) trained purely on text ostensibly lack any direct perceptual experience, yet their internal representations are implicitly shaped by multimodal regularities encoded in language. We test the hypothesis that explicit sensory prompting can surface this latent structure, bringing a text-only LLM into closer representational alignment with specialist vision and audio encoders. When a sensory prompt tells the model to 'see' or 'hear', it cues the model to resolve its next-token predictions as if they were conditioned on latent visual or auditory evidence that is never actually supplied. Our findings reveal that lightweight prompt engineering can reliably activate modality-appropriate representations in purely text-trained LLMs.

large language model, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

2510.02425

Country:

Asia (1.00)
Europe (0.93)
North America > United States (0.46)

Genre: Research Report > New Finding (1.00)

Industry: Leisure & Entertainment > Sports > Baseball (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

Efficient Deployment of Vision-Language Models on Mobile Devices: A Case Study on OnePlus 13R

Guerrero, Pablo Robin, Pan, Yueyang, Kashyap, Sanidhya

arXiv.org Artificial IntelligenceJul-15-2025

Vision-Language Models (VLMs) offer promising capabilities for mobile devices, but their deployment faces significant challenges due to computational limitations and energy inefficiency, especially for real-time applications. This study provides a comprehensive survey of deployment frameworks for VLMs on mobile devices, evaluating llama.cpp, MLC-Imp, and mllm in the context of running LLaVA-1.5 7B, MobileVLM-3B, and Imp-v1.5 3B as representative workloads on a OnePlus 13R. Each deployment framework was evaluated on the OnePlus 13R while running VLMs, with measurements covering CPU, GPU, and NPU utilization, temperature, inference time, power consumption, and user experience. Benchmarking revealed critical performance bottlenecks across frameworks: CPU resources were consistently over-utilized during token generation, while GPU and NPU accelerators were largely unused. When the GPU was used, primarily for image feature extraction, it was saturated, leading to degraded device responsiveness. The study contributes framework-level benchmarks, practical profiling tools, and an in-depth analysis of hardware utilization bottlenecks, highlighting the consistent overuse of CPUs and the ineffective or unstable use of GPUs and NPUs in current deployment frameworks.

artificial intelligence, llama, natural language, (16 more...)

arXiv.org Artificial Intelligence

2507.08505

Genre: Research Report (1.00)

Industry: Energy (0.68)

Technology:

Information Technology > Sensing and Signal Processing > Image Processing (1.00)
Information Technology > Hardware (1.00)
Information Technology > Communications > Mobile (1.00)
(2 more...)

Add feedback

Hybrid Systolic Array Accelerator with Optimized Dataflow for Edge Large Language Model Inference

Chen, Chun-Ting, Mun, HanGyeol, Meng, Jian, Abdelfattah, Mohamed S., Seo, Jae-sun

arXiv.org Artificial IntelligenceJul-15-2025

Edge inference for large language models (LLM) offers secure, low-latency, and cost-effective inference solutions. We emphasize that an edge accelerator should achieve high area efficiency and minimize external memory access (EMA) during the memory-bound decode stage, while maintaining high energy efficiency during the compute intensive prefill stage. This paper proposes an edge LLM inference accelerator featuring a hybrid systolic array (HSA) architecture that optimizes inference efficiency in both stages. To further reduce EMA, we adopt MXINT4 weight quantization and propose an optimized dataflow tailored for HSA, ensuring negligible dequantization overhead and achieving 100% hardware utilization with minimal accuracy loss under edge DRAM bandwidth constraints. For non-linear operations, we incorporate optimized root mean square normalization (RMSNorm) and rotary position embedding (RoPE) units, reducing their latency, area, and memory access overhead while enabling end-to-end inference on our accelerator. Our solution achieves 247/117 (token/s/mm2) while running a 1.3B LLM on long-input/long-output scenarios, providing >2.45x/13.5x improvement over existing approaches, while maintaining superior energy efficiency in token generation.

large language model, machine learning, natural language, (21 more...)

arXiv.org Artificial Intelligence

2507.0901

Genre: Research Report > New Finding (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.48)

Add feedback

Splitwiser: Efficient LM inference with constrained resources

Aali, Asad, Cardoza, Adney, Capo, Melissa

arXiv.org Artificial IntelligenceMay-8-2025

--Efficient inference of LLMs remains a crucial challenge, with two main phases: a compute-intensive prompt computation and a memory-intensive token generation. Despite existing batching and scheduling techniques, token generation phases fail to fully utilize compute resources, especially when compared to prompt computation phases. T o address these challenges, we propose Splitwiser, a methodology that splits the two phases of an LLM inference request onto the same GPU, thereby reducing overhead and improving memory access and cache utilization. By eliminating the need to transfer data across devices, Splitwiser aims to minimize network-related overheads. In this report, we describe the basic structure of our proposed pipeline while sharing preliminary results and analysis. We implement our proposed multiprocessing design on two widely-used and independent LLM architectures: Huggingface and vLLM. Generative Large Language Models (LLMs) have become essential in computing, offering vast capabilities in natural language processing. However, their widespread adoption has led to challenges, particularly in inference efficiency.

large language model, machine learning, natural language, (21 more...)

arXiv.org Artificial Intelligence

2505.03763

Country:

North America > United States > California > Santa Clara County > Palo Alto (0.04)
North America > United States > California > San Diego County > Carlsbad (0.04)

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

Fast Autoregressive Models for Continuous Latent Generation

Hang, Tiankai, Bao, Jianmin, Wei, Fangyun, Chen, Dong

arXiv.org Artificial IntelligenceApr-28-2025

Autoregressive models have demonstrated remarkable success in sequential data generation, particularly in NLP, but their extension to continuous-domain image generation presents significant challenges. Recent work, the masked autoregressive model (MAR), bypasses quantization by modeling per-token distributions in continuous spaces using a diffusion head but suffers from slow inference due to the high computational cost of the iterative denoising process. To address this, we propose the Fast AutoRegressive model (FAR), a novel framework that replaces MAR's diffusion head with a lightweight shortcut head, enabling efficient few-step sampling while preserving autoregressive principles. Additionally, FAR seamlessly integrates with causal Transformers, extending them from discrete to continuous token generation without requiring architectural modifications. Experiments demonstrate that FAR achieves $2.3\times$ faster inference than MAR while maintaining competitive FID and IS scores. This work establishes the first efficient autoregressive paradigm for high-fidelity continuous-space image generation, bridging the critical gap between quality and scalability in visual autoregressive modeling.

arxiv preprint arxiv, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2504.18391

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.93)

Add feedback

Compression Barriers for Autoregressive Transformers

Haris, Themistoklis, Onak, Krzysztof

arXiv.org Artificial IntelligenceFeb-21-2025

A key limitation of autoregressive Transformers is the large memory needed at inference-time to cache all previous key-value (KV) embeddings. Prior works address this by compressing the KV cache, but often assume specific structural properties of the embeddings. This raises the following natural question: Can truly sublinear space utilization be achieved without such assumptions? In this work, we answer this question in the negative. Any algorithm for attention-based token generation must use $\Theta(nd)$ space, where $n$ is the number of tokens generated so far and $d = \Omega(\log n)$ is the dimension of the KV embeddings. Our proof involves a reduction from a classic communication complexity problem and uses a randomized construction that leverages properties of projections in the spirit of the Johnson-Linderstrauss lemma. For the low-dimensional regime $d = o(\log n)$, we show that any algorithm requires $\Omega(d\cdot e^d)$ space and prove, using tight bounds on covering numbers, that SubGen, proposed by Zandieh, Han, Mirrokni and Karbasi, matches this bound. Further, we investigate how sparsity assumptions enable token generation in truly sublinear space, presenting impossibility results and proposing a new KV cache compression algorithm for sliding window attention when the value cache outside the window is unmasked. Finally, we analyze token generation's time complexity, using an indistinguishability argument to prove that no non-adaptive algorithm can compute attention online in sublinear time for all tokens.

large language model, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2502.15955

Country:

North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.68)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.47)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

Asynchronous LLM Function Calling

Gim, In, Lee, Seung-seob, Zhong, Lin

arXiv.org Artificial IntelligenceDec-9-2024

Large language models (LLMs) use function calls to interface with external tools and data source. However, the current approach to LLM function calling is inherently synchronous, where each call blocks LLM inference, limiting LLM operation and concurrent function execution. In this work, we propose AsyncLM, a system for asynchronous LLM function calling. AsyncLM improves LLM's operational efficiency by enabling LLMs to generate and execute function calls concurrently. Instead of waiting for each call's completion, AsyncLM introduces an interrupt mechanism to asynchronously notify the LLM in-flight when function calls return. We design an in-context protocol for function calls and interrupts, provide fine-tuning strategy to adapt LLMs to the interrupt semantics, and implement these mechanisms efficiently on LLM inference process. We demonstrate that AsyncLM can reduce end-to-end task completion latency from 1.6x-5.4x compared to synchronous function calling on a set of benchmark tasks in the Berkeley function calling leaderboard (BFCL). Furthermore, we discuss how interrupt mechanisms can be extended to enable novel human-LLM or LLM-LLM interactions.

large language model, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2412.07017

Country: North America > United States > Connecticut > New Haven County > New Haven (0.04)

Genre: Research Report (0.50)

Industry:

Information Technology (0.67)
Transportation > Air (0.34)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.70)

Add feedback

Counterfactual Token Generation in Large Language Models

Chatzi, Ivi, Benz, Nina Corvelo, Straitouri, Eleni, Tsirtsis, Stratis, Gomez-Rodriguez, Manuel

arXiv.org Artificial IntelligenceNov-6-2024

"Sure, I am happy to generate a story for you: Captain Lyra stood at the helm of her trusty ship, the Maelstrom's Fury, gazing out at the endless sea. [...] Lyra's eyes welled up with tears as she realized the bitter truth - she had sacrificed everything for fleeting riches, and lost the love of her crew, her family, and herself." Although this story, generated by a large language model, is captivating, one may wonder -- how would the story have unfolded if the model had chosen "Captain Maeve" as the protagonist instead? We cannot know. State-of-the-art large language models are stateless -- they maintain no internal memory or state. Given a prompt, they generate a sequence of tokens as an output using an autoregressive process. As a consequence, they cannot reason about counterfactual alternatives to tokens they have generated in the past. In this work, our goal is to enhance them with this functionality. To this end, we develop a causal model of token generation that builds upon the Gumbel-Max structural causal model. Our model allows any large language model to perform counterfactual token generation at almost no cost in comparison with vanilla token generation, it is embarrassingly simple to implement, and it does not require any fine-tuning nor prompt engineering. We implement our model on Llama 3 8B-Instruct and Ministral-8B-Instruct and conduct a qualitative and a quantitative analysis of counterfactually generated text. We conclude with a demonstrative application of counterfactual token generation for bias detection, unveiling interesting insights about the model of the world constructed by large language models.

counterfactual token generation, sequence, token generation, (15 more...)

arXiv.org Artificial Intelligence

2409.17027

Country:

North America > United States > Alaska (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
Europe > Germany > Rhineland-Palatinate > Kaiserslautern (0.04)

Genre: Research Report (0.82)

Industry:

Health & Medicine > Therapeutic Area (0.67)
Education > Educational Setting (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)

Add feedback

Root Defence Strategies: Ensuring Safety of LLM at the Decoding Level

Zeng, Xinyi, Shang, Yuying, Zhu, Yutao, Chen, Jiawei, Tian, Yu

arXiv.org Artificial IntelligenceOct-9-2024

Large language models (LLMs) have demonstrated immense utility across various industries. However, as LLMs advance, the risk of harmful outputs increases due to incorrect or malicious instruction prompts. While current methods effectively address jailbreak risks, they share common limitations: 1) Judging harmful responses from the prefill-level lacks utilization of the model's decoding outputs, leading to relatively lower effectiveness and robustness. This paper examines the LLMs' capability to recognize harmful outputs, revealing and quantifying their proficiency in assessing the danger of previous tokens. Our novel decoder-oriented, step-bystep defense architecture corrects harmful queries directly rather than rejecting them outright. We introduce speculative decoding to enhance usability and facilitate deployment to boost secure decoding speed. Extensive experiments demonstrate that our approach improves model security without compromising reasoning speed. Notably, our method leverages the model's ability to discern hazardous information, maintaining its helpfulness compared to existing methods. In recent years, significant progress has been made in developing large language models (LLMs). Meanwhile, the safety of LLMs has attracted significant attention from the research community and industry (Weidinger et al., 2021; Achiam et al., 2023; Wu et al., 2023b). One of the primary safety concerns is jailbreaking, where malicious actors or errant inputs prompt LLMs to produce harmful or inappropriate content, effectively bypassing ethical guidelines.

arxiv preprint arxiv, llm, query, (14 more...)

arXiv.org Artificial Intelligence

2410.06809

Country:

North America > United States (0.04)
Asia > China > Beijing > Beijing (0.04)
Asia > China > Shanghai > Shanghai (0.04)

Genre: Research Report (1.00)

Industry: Information Technology > Security & Privacy (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.72)

Add feedback