Goto

Collaborating Authors

 caldera



Compressing Large Language Models using Low Rank and Low Precision Decomposition

Neural Information Processing Systems

Due to the correlated nature of language syntax and semantics learned during training, often, the weight matrices of LLMs exhibit redundancy, which manifests as a low-rank structure. This redundancy suggests the potential for compression without substantial loss in performance.


Assigning Distinct Roles to Quantized and Low-Rank Matrices Toward Optimal Weight Decomposition

Cho, Yoonjun, Kim, Soeun, Jeon, Dongjae, Lee, Kyelim, Lee, Beomsoo, No, Albert

arXiv.org Artificial Intelligence

Decomposing weight matrices into quantization and low-rank components ($\mathbf{W} \approx \mathbf{Q} + \mathbf{L}\mathbf{R}$) is a widely used technique for compressing large language models (LLMs). Existing joint optimization methods iteratively alternate between quantization and low-rank approximation. However, these methods tend to prioritize one component at the expense of the other, resulting in suboptimal decompositions that fail to leverage each component's unique strengths. In this work, we introduce Outlier-Driven Low-Rank Initialization (ODLRI), which assigns low-rank components the specific role of capturing activation-sensitive weights. This structured decomposition mitigates outliers' negative impact on quantization, enabling more effective balance between quantization and low-rank approximation. Experiments on Llama2 (7B, 13B, 70B), Llama3-8B, and Mistral-7B demonstrate that incorporating ODLRI into the joint optimization framework consistently reduces activation-aware error, minimizes quantization scale, and improves perplexity and zero-shot accuracy in low-bit settings.


Compressing Large Language Models using Low Rank and Low Precision Decomposition

Saha, Rajarshi, Sagan, Naomi, Srivastava, Varun, Goldsmith, Andrea J., Pilanci, Mert

arXiv.org Machine Learning

The prohibitive sizes of Large Language Models (LLMs) today make it difficult to deploy them on memory-constrained edge devices. This work introduces $\rm CALDERA$ -- a new post-training LLM compression algorithm that harnesses the inherent low-rank structure of a weight matrix $\mathbf{W}$ by approximating it via a low-rank, low-precision decomposition as $\mathbf{W} \approx \mathbf{Q} + \mathbf{L}\mathbf{R}$. Here, $\mathbf{L}$ and $\mathbf{R}$ are low rank factors, and the entries of $\mathbf{Q}$, $\mathbf{L}$ and $\mathbf{R}$ are quantized. The model is compressed by substituting each layer with its $\mathbf{Q} + \mathbf{L}\mathbf{R}$ decomposition, and the zero-shot performance of the compressed model is evaluated. Additionally, $\mathbf{L}$ and $\mathbf{R}$ are readily amenable to low-rank adaptation, consequently enhancing the zero-shot performance. $\rm CALDERA$ obtains this decomposition by formulating it as an optimization problem $\min_{\mathbf{Q},\mathbf{L},\mathbf{R}}\lVert(\mathbf{Q} + \mathbf{L}\mathbf{R} - \mathbf{W})\mathbf{X}^\top\rVert_{\rm F}^2$, where $\mathbf{X}$ is the calibration data, and $\mathbf{Q}, \mathbf{L}, \mathbf{R}$ are constrained to be representable using low-precision formats. Theoretical upper bounds on the approximation error of $\rm CALDERA$ are established using a rank-constrained regression framework, and the tradeoff between compression ratio and model performance is studied by analyzing the impact of target rank and quantization bit budget. Results illustrate that compressing LlaMa-$2$ $7$B/$70$B and LlaMa-$3$ $8$B models obtained using $\rm CALDERA$ outperforms existing post-training LLM compression techniques in the regime of less than $2.5$ bits per parameter. The implementation is available at: \href{https://github.com/pilancilab/caldera}{https://github.com/pilancilab/caldera}.


'Warzone 2.0' impressions: Good riddance, Caldera. Hello, Al Mazrah.

Washington Post - Technology News

The next major issue with proximity chat is that you can turn it off. The game's settings do not make it clear what that means exactly, but it probably means that those disabling the feature will not be able to hear other players outside their party, nor be heard by anyone else. If players are somehow able to talk without their comms registering on proximity chat while still hearing other players, that would be a huge advantage for players disabling the feature. There's also the possibility that some PC and Xbox players are using a third-party client such as Discord for their voice chat, which could allow them to eavesdrop without being heard themselves. Much like the third-person mode, which I haven't sampled yet, this might be a feature that warrants its own playlist rather than serving as a toggle bar in the game settings menu.


Tackling Financial Fraud With Machine Learning

#artificialintelligence

They can also be used for financial fraud. Fraudsters can use deepfake technology to trick employees at financial institutions into changing account numbers and initiating money transfer requests for substantial amounts, says Satish Lalchand, principal at Deloitte Transaction and Business Analytics. He notes that these transactions are often difficult, if not impossible, to reverse. Cybercriminals are constantly adopting new techniques to evade know-your-customer verification processes and fraud detection controls. In response, many businesses are exploring ways machine learning (ML) can detect fraudulent transactions involving synthetic media, synthetic identity fraud, or other suspicious behaviors.


'Warzone Pacific' loadout guide for Caldera: The best guns, attachments and perks

Washington Post - Technology News

If you need to get your teammate out of a hairy situation by raining down a deluge of bullets on enemies, then this is the class for you. I went full recoil control on the MG-42, and because of that it handles extremely well, yet is quite slow. The range and damage it does is ridiculous, however, which makes it a powerful primary option. Pair that with an SMG and the Vital perk and opposing players will be questioning if "Warzone's" Ricochet anti-cheat software is working. You're not going to be chasing down anyone with this weighty loadout, and the stims can give you a health and speed boost if you get caught in a late rotation or lagging behind the gas.


Scientists warn they have no accurate way to predict when supervolcano explosions could occur

Daily Mail - Science & tech

Volcanologists can predict when volcanos are going to erupt if they have a full detail of its eruptions. But for potentially apocalyptic supervolcanoes, such as the one bubbling under Yellowstone National Park, it's nearly impossible, given how varied their known eruptions have been, according to a new study. Researchers at Cardiff University noted there is not a'single model' that can help scientists understand how eruptions from supervolcanoes happen, making it difficult to understand when they might occur in the future. The researchers looked at geochemical and petrological evidence of 13 supereruptions that have happened over the past 2 million years, including the most recent one, Taupō volcano in New Zealand, which happened more than 24,000 years ago. Experts said there is not a'single model' that can help them understand how eruptions from supervolcanoes happen There was no'single, unified mode' that showed how each of the 13 played out, with some starting gradually over a period of weeks to months, while others exploded suddenly and violently. The researchers also found that the eruptions lasted for varying times, some as short as a period of days or weeks, while others lasted decades.