time
Transformer in Transformer
Transformer is a new kind of neural architecture which encodes the input data as powerful features via the attention mechanism. Basically, the visual transformers first divide the input images into several local patches and then calculate both representations and their relationship. Since natural images are of high complexity with abundant detail and color information, the granularity of the patch dividing is not fine enough for excavating features of objects in different scales and locations. In this paper, we point out that the attention inside these local patches are also essential for building visual transformers with high performance and we explore a new architecture, namely, Transformer iN Transformer (TNT). Specifically, we regard the local patches (\eg, 16$\times$16) as "visual sentences" and present to further divide them into smaller patches (\eg, 4$\times$4) as "visual words". The attention of each word will be calculated with other words in the given visual sentence with negligible computational costs. Features of both words and sentences will be aggregated to enhance the representation ability. Experiments on several benchmarks demonstrate the effectiveness of the proposed TNT architecture, \eg, we achieve an 81.5\% top-1 accuracy on the ImageNet, which is about 1.7\% higher than that of the state-of-the-art visual transformer with similar computational cost.
Stochastic Taylor Derivative Estimator: Efficient amortization for arbitrary differential operators
Optimizing neural networks with loss that contain high-dimensional and high-order differential operators is expensive to evaluate with back-propagation due to $\mathcal{O}(d^{k})$ scaling of the derivative tensor size and the $\mathcal{O}(2^{k-1}L)$ scaling in the computation graph, where $d$ is the dimension of the domain, $L$ is the number of ops in the forward computation graph, and $k$ is the derivative order. In previous works, the polynomial scaling in $d$ was addressed by amortizing the computation over the optimization process via randomization. Separately, the exponential scaling in $k$ for univariate functions ($d=1$) was addressed with high-order auto-differentiation (AD). In this work, we show how to efficiently perform arbitrary contraction of the derivative tensor of arbitrary order for multivariate functions, by properly constructing the input tangents to univariate high-order AD, which can be used to efficiently randomize any differential operator. When applied to Physics-Informed Neural Networks (PINNs), our method provides > 1000$\times$ speed-up and > 30$\times$ memory reduction over randomization with first-order AD, and we can now solve 1-million-dimensional PDEs in 8 minutes on a single NVIDIA A100 GPU. This work opens the possibility of using high-order differential operators in large-scale problems.
BAN: Detecting Backdoors Activated by Adversarial Neuron Noise
Backdoor attacks on deep learning represent a recent threat that has gained significant attention in the research community. Backdoor defenses are mainly based on backdoor inversion, which has been shown to be generic, model-agnostic, and applicable to practical threat scenarios. State-of-the-art backdoor inversion recovers a mask in the feature space to locate prominent backdoor features, where benign and backdoor features can be disentangled. However, it suffers from high computational overhead, and we also find that it overly relies on prominent backdoor features that are highly distinguishable from benign features. To tackle these shortcomings, this paper improves backdoor feature inversion for backdoor detection by incorporating extra neuron activation information. In particular, we adversarially increase the loss of backdoored models with respect to weights to activate the backdoor effect, based on which we can easily differentiate backdoored and clean models. Experimental results demonstrate our defense, BAN, is 1.37$\times$ (on CIFAR-10) and 5.11$\times$ (on ImageNet200) more efficient with an average 9.99\% higher detect success rate than the state-of-the-art defense BTI DBF. Our code and trained models are publicly available at https://github.com/xiaoyunxxy/ban.
3 Common Misunderstandings About AI in 2025
Children and parked cars are color-coded on a monitor inside a Mercedes-Benz S-Class during an autonomous driving and AI demonstration in Immendingen, Germany on July 17, 2018. Children and parked cars are color-coded on a monitor inside a Mercedes-Benz S-Class during an autonomous driving and AI demonstration in Immendingen, Germany on July 17, 2018. In 2025, misconceptions about AI flourished as people struggled to make sense of the rapid development and adoption of the technology. Here are three popular ones to leave behind in the New Year. When GPT-5 was released in May, people wondered (not for the first time) if AI was hitting a wall.
- Europe > Germany (0.46)
- North America > United States > New York (0.05)
- North America > United States > California > San Francisco County > San Francisco (0.05)
- (3 more...)
- Transportation > Ground > Road (1.00)
- Automobiles & Trucks > Manufacturer (0.97)
3 New Tricks to Try With Google Gemini Live After Its Latest Major Upgrade
Google's AI is now even smarter, and more versatile. Gemini Live is the more conversational, natural language way of interacting with the Google Gemini AI bot using your voice. The idea is you chat with it like you would chat with a friend, interruptions and all, even if the actual answers are the same as you'd get from typing your queries into Gemini as normal. Now, about a year and a half after its debut, Gemini Live has been given what Google is describing as its "biggest update ever." The update makes the Gemini Live mode even more natural and even more conversational than before, with a better understanding of tone, nuance, pronunciation, and rhythm.
- North America > United States > California (0.15)
- Europe > United Kingdom (0.05)
- Europe > Slovakia (0.05)
- (3 more...)
- Information Technology (0.48)
- Media (0.48)
Four of the Strangest AI Moments in 2025
Pillay is an editorial fellow at TIME. Albania's new AI-generated minister Diella speaks during the parliamentary session for the voting of the new government of Albania, in Tirana on Sept. 18, 2025. Albania's new AI-generated minister Diella speaks during the parliamentary session for the voting of the new government of Albania, in Tirana on Sept. 18, 2025. Pillay is an editorial fellow at TIME. It's been three years since the launch of ChatGPT gave hundreds of millions of people access to a kind of digital genie in their pocket--and things have been getting stranger by the month. Besides billions of AI-generated emails and the technology's widespread disruption of education and cognitive work, in 2025, some people began to fall in love with their AIs.
- Europe > Albania > Tirana County > Tirana (0.46)
- North America > United States (0.05)
- Europe > Hungary (0.05)
- (4 more...)
MiniCache: KV Cache Compression in Depth Dimension for Large Language Models
A critical approach for efficiently deploying computationally demanding large language models (LLMs) is Key-Value (KV) caching. The KV cache stores key-value states of previously generated tokens, significantly reducing the need for repetitive computations and thereby lowering latency in autoregressive generation. However, the size of the KV cache grows linearly with sequence length, posing challenges for applications requiring long context input and extensive sequence generation. In this paper, we present a simple yet effective approach, called MiniCache, to compress the KV cache across layers from a novel depth perspective, significantly reducing the memory footprint for LLM inference. Our approach is based on the observation that KV cache states exhibit high similarity between the adjacent layers in the middle-to-deep portion of LLMs.
Sequoia: Scalable and Robust Speculative Decoding
As the usage of large language models (LLMs) grows, it becomes increasingly important to serve them quickly and efficiently. While speculative decoding has recently emerged as a promising direction for accelerating LLM serving, existing methods are limited in their ability to scale to larger speculation budgets and adapt to different hyperparameters. This paper introduces Sequoia, a scalable and robust algorithm for speculative decoding. To improve scalability, Sequoia introduces a dynamic programming algorithm to find an optimal tree structure for the speculated tokens. To achieve robust speculative decoding, Sequoia uses a novel sampling and verification method that outperforms prior work across different decoding temperatures. Sequoia improves the decoding speed of Llama2-7B, Llama2-13B, and Vicuna-33B on an A100 GPU by up to $4.04\times$, $3.73\times$, and $2.27 \times$. To serve Llama3-70B-Instruct on a single L40 GPU through offloading, Sequoia reduces the per-token decoding latency to 0.60 s/token, $9.5\times$ faster than DeepSpeed-Zero-Inference.
ArkVale: Efficient Generative LLM Inference with Recallable Key-Value Eviction
Large Language Models (LLMs) are widely used in today's tasks of natural language processing. To support applications like multi-turn chats, document understanding, and content generation, models with long context lengths are growing in importance.However, managing long contexts brings substantial challenges due to the expansion of key-value cache (KV cache). Longer KV cache requires larger memory, limiting the batch-size thus decreasing throughput. Also, computing attention over long KV cache incurs more memory access, hurting the end-to-end latency.Prior works find that it is sufficient to use only the recent and high-impact tokens for attention computation, allowing the eviction of less vital tokens to shrink cache size.Nonetheless, we observe a dynamic shift in token importance across different decoding steps. Tokens initially evicted might regain importance after certain decoding steps.To address this, we propose ArkVale, a page-based KV cache manager that can recognize and recall currently important tokens evicted before. We asynchronously copy the filled page into external memory (e.g., CPU memory) as backup and summarize it into a much smaller digest by constructing the bounding-volume of its keys. Before attention computation, we measure all pages' importance based on their digests, recall the important ones, evict the unimportant ones, and select the top-ranked pages for attention computation. Experiment results show that ArkVale performs well on various long context tasks with negligible accuracy loss under 2k$\sim$4k cache budget and can improve decoding latency to $2.2\times$ and batching throughput to $4.6\times$ because it applies attention on only a small subset of pages and reduce per-sample memory usage of KV cache.
PrivCirNet: Efficient Private Inference via Block Circulant Transformation
Homomorphic encryption (HE)-based deep neural network (DNN) inference protects data and model privacy but suffers from significant computation overhead. We observe transforming the DNN weights into circulant matrices converts general matrix-vector multiplications into HE-friendly 1-dimensional convolutions, drastically reducing the HE computation cost. Hence, in this paper, we propose PrivCirNet, a protocol/network co-optimization framework based on block circulant transformation. At the protocol level, PrivCirNet customizes the HE encoding algorithm that is fully compatible with the block circulant transformation and reduces the computation latency in proportion to the block size. At the network level, we propose a latency-aware formulation to search for the layer-wise block size assignment based on second-order information.