Goto

Collaborating Authors

 cach


How squirrels actually find all their buried nuts

Popular Science

Every fall, squirrels hide hundreds of acorns--and use smell, memory, and even theft to get them back. Every fall, squirrels stash hundreds of nuts to survive the colder winter months. Breakthroughs, discoveries, and DIY tips sent every weekday. As someone who routinely "hides" things from myself--car keys, receipts, even my phone while I'm actively talking on it--I felt instantly validated by Sarah Silverman's joke that squirrels forget where they bury 80% of their nuts. "And that's how trees are planted!"

  Country:
  Genre: Research Report (0.35)
  Industry:

Improving the Serving Performance of Multi-LoRA Large Language Models via Efficient LoRA and KV Cache Management

Zhang, Hang, Shi, Jiuchen, Wang, Yixiao, Chen, Quan, Shan, Yizhou, Guo, Minyi

arXiv.org Artificial Intelligence

Multiple Low-Rank Adapters (Multi-LoRAs) are gaining popularity for task-specific Large Language Model (LLM) applications. For multi-LoRA serving, caching hot KV caches and LoRA adapters in high bandwidth memory of accelerations can improve inference performance. However, existing Multi-LoRA inference systems fail to optimize serving performance like Time-To-First-Toke (TTFT), neglecting usage dependencies when caching LoRAs and KVs. We therefore propose FASTLIBRA, a Multi-LoRA caching system to optimize the serving performance. FASTLIBRA comprises a dependency-aware cache manager and a performance-driven cache swapper. The cache manager maintains the usage dependencies between LoRAs and KV caches during the inference with a unified caching pool. The cache swapper determines the swap-in or out of LoRAs and KV caches based on a unified cost model, when the HBM is idle or busy, respectively. Experimental results show that ELORA reduces the TTFT by 63.4% on average, compared to state-of-the-art works.


CacheFocus: Dynamic Cache Re-Positioning for Efficient Retrieval-Augmented Generation

Lee, Kun-Hui, Park, Eunhwan, Han, Donghoon, Na, Seung-Hoon

arXiv.org Artificial Intelligence

Large Language Models (LLMs) excel across a variety of language tasks yet are constrained by limited input lengths and high computational costs. Existing approaches\textemdash such as relative positional encodings (e.g., RoPE, ALiBi) and sliding window mechanisms\textemdash partially alleviate these issues but often require additional training or suffer from performance degradation with longer inputs. In this paper, we introduce \textbf{\textit{CacheFocus}}, a method that enhances length normalization and reduces inference latency without any further training. Our approach leverages query-independent, offline caching to efficiently reuse a Context KV Cache Store. We address the amplification of abnormal token distributions problem by re-positioning cached keys and introducing Layer-Adaptive Cache Pruning to discard low-relevance caches during pre-filling. Additionally, our Adaptive Positional Allocation Strategy dynamically reassigns cache positions to maximize the use of the available positional encoding range. Experiments on the Natural Questions and TriviaQA datasets demonstrate that CacheFocus outperforms alternative methods even when inputs exceed the $4$K limit of the \texttt{LLaMA-2} model, emphasizing its practical effectiveness for long-context LLMs. Moreover, even with large maximum input length of \texttt{Qwen2}, the performance of CacheFocus shows that it maintains consistent performance even as the number of documents increases, effectively managing long-text generation without degradation.


Parallel Key-Value Cache Fusion for Position Invariant RAG

Oh, Philhoon, Shin, Jinwoo, Thorne, James

arXiv.org Artificial Intelligence

Recent advancements in Large Language Models (LLMs) underscore the necessity of Retrieval Augmented Generation (RAG) to leverage external information. However, LLMs are sensitive to the position of relevant information within contexts and tend to generate incorrect responses when such information is placed in the middle, known as `Lost in the Middle' phenomenon. In this paper, we introduce a framework that generates consistent outputs for decoder-only models, irrespective of the input context order. Experimental results for three open domain question answering tasks demonstrate position invariance, where the model is not sensitive to input context order, and superior robustness to irrelevent passages compared to prevailing approaches for RAG pipelines.


Compressed Sensor Caching and Collaborative Sparse Data Recovery with Anchor Alignment

Yang, Yi-Jen, Yang, Ming-Hsun, Wu, Jwo-Yuh, Hong, Y. -W. Peter

arXiv.org Artificial Intelligence

This work examines the compressed sensor caching problem in wireless sensor networks and devises efficient distributed sparse data recovery algorithms to enable collaboration among multiple caches. In this problem, each cache is only allowed to access measurements from a small subset of sensors within its vicinity to reduce both cache size and data acquisition overhead. To enable reliable data recovery with limited access to measurements, we propose a distributed sparse data recovery method, called the collaborative sparse recovery by anchor alignment (CoSR-AA) algorithm, where collaboration among caches is enabled by aligning their locally recovered data at a few anchor nodes. The proposed algorithm is based on the consensus alternating direction method of multipliers (ADMM) algorithm but with message exchange that is reduced by considering the proposed anchor alignment strategy. Then, by the deep unfolding of the ADMM iterations, we further propose the Deep CoSR-AA algorithm that can be used to significantly reduce the number of iterations. We obtain a graph neural network architecture where message exchange is done more efficiently by an embedded autoencoder. Simulations are provided to demonstrate the effectiveness of the proposed collaborative recovery algorithms in terms of the improved reconstruction quality and the reduced communication overhead due to anchor alignment.


You Only Cache Once: Decoder-Decoder Architectures for Language Models

Sun, Yutao, Dong, Li, Zhu, Yi, Huang, Shaohan, Wang, Wenhui, Ma, Shuming, Zhang, Quanlu, Wang, Jianyong, Wei, Furu

arXiv.org Artificial Intelligence

We introduce a decoder-decoder architecture, YOCO, for large language models, which only caches key-value pairs once. It consists of two components, i.e., a cross-decoder stacked upon a self-decoder. The self-decoder efficiently encodes global key-value (KV) caches that are reused by the cross-decoder via cross-attention. The overall model behaves like a decoder-only Transformer, although YOCO only caches once. The design substantially reduces GPU memory demands, yet retains global attention capability. Additionally, the computation flow enables prefilling to early exit without changing the final output, thereby significantly speeding up the prefill stage. Experimental results demonstrate that YOCO achieves favorable performance compared to Transformer in various settings of scaling up model size and number of training tokens. We also extend YOCO to 1M context length with near-perfect needle retrieval accuracy. The profiling results show that YOCO improves inference memory, prefill latency, and throughput by orders of magnitude across context lengths and model sizes. Code is available at https://aka.ms/YOCO.


Alfresco Repository Caches Unfolded

#artificialintelligence

Alfresco repository Caches optimisation can have significant impact on the performance of your Alfresco deployment. This post provides an overview on how the repository caches are implemented by Alfresco. The Alfresco repository leverages and provides in-memory caches. Memory caching (often simply referred to as caching) is a technique in which computer applications temporarily store data in a computer's main memory (i.e., random access memory, or RAM) to enable fast retrievals of that data. The RAM that is used for the temporary storage is known as the cache.


Research Paves the Way for Honey-Based Neuromorphic Computing

#artificialintelligence

Researchers at Washington State University have built a proof-of-concept device that includes one of the crucial circuits for neuromorphic computing - the memristor - built out of an unlikely medium: honey. The researchers hope their research paves the way for biodegradable, sustainable, organic-based computing systems that are orders of magnitude more efficient than conventional computing architectures. To build the device, the researchers processed true, bee-sourced honey into a solid form held between two metal electrodes, much like how your brain's synapses lay between pairs of neuron. The device was then tested for its ability to quickly switch on and off at speeds ranging between their biological counterparts' 100 and 500 nanoseconds - and it succeeded. "This is a very small device with a simple structure, but it has very similar functionalities to a human neuron," said Feng Zhao, associate professor of WSU's School of Engineering and Computer Science, in the announcement.


Birds get angry when their favourite snacks are swapped in magic trick

New Scientist

Jays react angrily when shown a cup-and-balls-style magic trick in which their favourite snack is swapped for a less appealing one. Their responses show cognitive abilities that may come into play when they pilfer food caches hidden by other birds. Eurasian jays (Garrulus glandarius) have impressive memories and show some capacity for imagining the beliefs and intentions of others, known as theory of mind. As such, Alexandra Schnell and her colleagues at the University of Cambridge wondered whether jays would be sensitive to cognitive illusions designed to fool humans. First, they tested six birds to find out which food each one preferred from a choice of worms, cheese and peanuts.


Applying the Roofline model for Deep Learning performance optimizations

Czaja, Jacek, Gallus, Michal, Wozna, Joanna, Grygielski, Adam, Tao, Luo

arXiv.org Artificial Intelligence

In this paper We present a methodology for creating Roofline models automatically for Non-Unified Memory Access (NUMA) using Intel Xeon as an example. Finally, we present an evaluation of highly efficient deep learning primitives as implemented in the Intel oneDNN Library.