Large Language Model
D-LLM: AT oken Adaptive Computing Resource Allocation Strategy for Large Language Models
Large language models have shown an impressive societal impact owing to their excellent understanding and logical reasoning skills. However, such strong ability relies on a huge amount of computing resources, which makes it difficult to deploy LLMs on computing resource-constrained platforms. Currently, LLMs process each token equivalently, but we argue that not every word is equally important. Some words should not be allocated excessive computing resources, particularly for dispensable terms in simple questions. In this paper, we propose a novel dynamic inference paradigm for LLMs, namely D-LLMs, which adaptively allocate computing resources in token processing. We design a dynamic decision module for each transformer layer that decides whether a network unit should be executed or skipped. Moreover, we tackle the issue of adapting D-LLMs to real-world applications, specifically concerning the missing KV -cache when layers are skipped. To overcome this, we propose a simple yet effective eviction policy to exclude the skipped layers from subsequent attention calculations. The eviction policy not only enables D-LLMs to be compatible with prevalent applications but also reduces considerable storage resources.
From shrimp Jesus to erotic tractors: how viral AI slop took over the internet
Clockwise from top left: Shrimp Jesus, Nayib Bukele, Justin Bieber and Super Cat League. Clockwise from top left: Shrimp Jesus, Nayib Bukele, Justin Bieber and Super Cat League. In the algorithm-driven economy of 2025, one man's shrimp Jesus is another man's side hustle. AI slop - the low-quality, surreal content flooding social media platforms, designed to farm views - is a phenomenon, some would say the phenomenon of the 2024 and 2025 internet. Merriam-Webster's word of the year this year is "slop", referring exclusively to the internet variety.
Discovering Sparsity Allocation for Layer-wise Pruning of Large Language Models
In this paper, we present DSA, the first automated framework for discovering sparsity allocation schemes for layer-wise pruning in Large Language Models (LLMs). LLMs have become increasingly powerful, but their large parameter counts make them computationally expensive. Existing pruning methods for compressing LLMs primarily focus on evaluating redundancies and removing element-wise weights. However, these methods fail to allocate adaptive layer-wise sparsities, leading to performance degradation in challenging tasks. We observe that per-layer importance statistics can serve as allocation indications, but their effectiveness depends on the allocation function between layers.