Amiri, Mohammad Mohammadi
Sparse Gradient Compression for Fine-Tuning Large Language Models
Yang, David H., Amiri, Mohammad Mohammadi, Pedapati, Tejaswini, Chaudhury, Subhajit, Chen, Pin-Yu
Fine-tuning large language models (LLMs) for downstream tasks has become increasingly crucial due to their widespread use and the growing availability of open-source models. However, the high memory costs associated with fine-tuning remain a significant challenge, especially as models increase in size. To address this, parameter efficient fine-tuning (PEFT) methods have been proposed to minimize the number of parameters required for fine-tuning LLMs. However, these approaches often tie the number of optimizer states to dimensions of model parameters, limiting flexibility and control during fine-tuning. In this paper, we propose sparse gradient compression (SGC), a training regime designed to address these limitations. Our approach leverages inherent sparsity in gradients to compress optimizer states by projecting them onto a low-dimensonal subspace, with dimensionality independent of the original model's parameters. By enabling optimizer state updates in an arbitrary low-dimensional subspace, SGC offers a flexible tradeoff between memory efficiency and performance. We demonstrate through experiments that SGC can decrease memory usage in optimizer states more effectively than existing PEFT methods. Furthermore, by fine-tuning LLMs on various downstream tasks, we show that SGC can deliver superior performance while substantially lowering optimizer state memory requirements, particularly in both data-limited and memory-limited settings.
Data Measurements for Decentralized Data Markets
Lu, Charles, Amiri, Mohammad Mohammadi, Raskar, Ramesh
Decentralized data markets can provide more equitable forms of data acquisition for machine learning. However, to realize practical marketplaces, efficient techniques for seller selection need to be developed. We propose and benchmark federated data measurements to allow a data buyer to find sellers with relevant and diverse datasets. Diversity and relevance measures enable a buyer to make relative comparisons between sellers without requiring intermediate brokers and training task-dependent models.
Federated Learning with Downlink Device Selection
Amiri, Mohammad Mohammadi, Kulkarni, Sanjeev R., Poor, H. Vincent
We study federated edge learning, where a global model is trained collaboratively using privacy-sensitive data at the edge of a wireless network. A parameter server (PS) keeps track of the global model and shares it with the wireless edge devices for training using their private local data. The devices then transmit their local model updates, which are used to update the global model, to the PS. The algorithm, which involves transmission over PS-to-device and device-to-PS links, continues until the convergence of the global model or lack of any participating devices. In this study, we consider device selection based on downlink channels over which the PS shares the global model with the devices. Performing digital downlink transmission, we design a partial device participation framework where a subset of the devices is selected for training at each iteration. Therefore, the participating devices can have a better estimate of the global model compared to the full device participation case which is due to the shared nature of the broadcast channel with the price of updating the global model with respect to a smaller set of data. At each iteration, the PS broadcasts different quantized global model updates to different participating devices based on the last global model estimates available at the devices. We investigate the best number of participating devices through experimental results for image classification using the MNIST dataset with biased distribution.
Blind Federated Edge Learning
Amiri, Mohammad Mohammadi, Duman, Tolga M., Gunduz, Deniz, Kulkarni, Sanjeev R., Poor, H. Vincent
We study federated edge learning (FEEL), where wireless edge devices, each with its own dataset, learn a global model collaboratively with the help of a wireless access point acting as the parameter server (PS). At each iteration, wireless devices perform local updates using their local data and the most recent global model received from the PS, and send their local updates to the PS over a wireless fading multiple access channel (MAC). The PS then updates the global model according to the signal received over the wireless MAC, and shares it with the devices. Motivated by the additive nature of the wireless MAC, we propose an analog `over-the-air' aggregation scheme, in which the devices transmit their local updates in an uncoded fashion. Unlike recent literature on over-the-air edge learning, here we assume that the devices do not have channel state information (CSI), while the PS has imperfect CSI. Instead, the PS is equipped multiple antennas to alleviate the destructive effect of the channel, exacerbated due to the lack of perfect CSI. We design a receive beamforming scheme at the PS, and show that it can compensate for the lack of perfect CSI when the PS has a sufficient number of antennas. We also derive the convergence rate of the proposed algorithm highlighting the impact of the lack of perfect CSI, as well as the number of PS antennas. Both the experimental results and the convergence analysis illustrate the performance improvement of the proposed algorithm with the number of PS antennas, where the wireless fading MAC becomes deterministic despite the lack of perfect CSI when the PS has a sufficiently large number of antennas.
Convergence of Federated Learning over a Noisy Downlink
Amiri, Mohammad Mohammadi, Gunduz, Deniz, Kulkarni, Sanjeev R., Poor, H. Vincent
We study federated learning (FL), where power-limited wireless devices utilize their local datasets to collaboratively train a global model with the help of a remote parameter server (PS). The PS has access to the global model and shares it with the devices for local training, and the devices return the result of their local updates to the PS to update the global model. This framework requires downlink transmission from the PS to the devices and uplink transmission from the devices to the PS. The goal of this study is to investigate the impact of the bandwidth-limited shared wireless medium in both the downlink and uplink on the performance of FL with a focus on the downlink. To this end, the downlink and uplink channels are modeled as fading broadcast and multiple access channels, respectively, both with limited bandwidth. For downlink transmission, we first introduce a digital approach, where a quantization technique is employed at the PS to broadcast the global model update at a common rate such that all the devices can decode it. Next, we propose analog downlink transmission, where the global model is broadcast by the PS in an uncoded manner. We consider analog transmission over the uplink in both cases. We further analyze the convergence behavior of the proposed analog approach assuming that the uplink transmission is error-free. Numerical experiments show that the analog downlink approach provides significant improvement over the digital one, despite a significantly lower transmit power at the PS. The experimental results corroborate the convergence results, and show that a smaller number of local iterations should be used when the data distribution is more biased, and also when the devices have a better estimate of the global model in the analog downlink approach.