Goto

Collaborating Authors

 Large Language Model





YouOnlyCacheOnce: Decoder-DecoderArchitecturesforLanguageModels

Neural Information Processing Systems

However, as the number of serving tokens increases, the key-value (KV) caches occupy a lot of GPU memory, rendering the inference of large language models memory-bounded [29].



HonestLLM: TowardanHonestandHelpfulLarge LanguageModel

Neural Information Processing Systems

To mitigate this gap, we first refine and extend the definition of honesty inLLMs based onthedefinition proposed byAskelletal.[14],astheability



QuIP: 2-Bit Quantization of Large Language Models With Guarantees

Neural Information Processing Systems

We introduce quantization with incoherence processing (QuIP), a new method based on the insight that quantization benefits from incoherent weight and Hessian matrices, i.e., from the weights being even in magnitude and the