Goto

Collaborating Authors

 Kisilev, Pavel


Rethinking Data: Towards Better Performing Domain-Specific Small Language Models

arXiv.org Artificial Intelligence

Fine-tuning of Large Language Models (LLMs) for downstream tasks, performed on domain-specific data has shown significant promise. However, commercial use of such LLMs is limited by the high computational cost required for their deployment at scale. On the other hand, small Language Models (LMs) are much more cost effective but have subpar performance in a similar setup. This paper presents our approach to finetuning a small LM, that reaches high accuracy in multiple choice question answering task. We achieve this by improving data quality at each stage of the LM training pipeline. In particular, we start with data structuring resulting in extraction of compact, semantically meaningful text chunks used by a retriever. This allows more efficient knowledge digestion by the LM. Further, we improve the retrieved context by training a lightweight Chunk Re-Ranker (CRR) that generates more accurate relative relevance chunk scores. Finally, we improve the model generalization ability by merging the models fine-tuned with different parameters on different data subsets. We present detailed procedure descriptions, and corresponding experimental findings that show the improvements of each one of the proposed techniques.


Adaptive Consensus Gradients Aggregation for Scaled Distributed Training

arXiv.org Artificial Intelligence

Distributed machine learning has recently become a critical paradigm for training large models on vast datasets. We examine the stochastic optimization problem for deep learning within synchronous parallel computing environments under communication constraints. While averaging distributed gradients is the most widely used method for gradient estimation, whether this is the optimal strategy remains an open question. In this work, we analyze the distributed gradient aggregation process through the lens of subspace optimization. By formulating the aggregation problem as an objective-aware subspace optimization problem, we derive an efficient weighting scheme for gradients, guided by subspace coefficients. We further introduce subspace momentum to accelerate convergence while maintaining statistical unbiasedness in the aggregation. Our method demonstrates improved performance over the ubiquitous gradient averaging on multiple MLPerf tasks while remaining extremely efficient in both communicational and computational complexity.


Primal-Dual Sequential Subspace Optimization for Saddle-point Problems

arXiv.org Machine Learning

We introduce a new sequential subspace optimization method for large-scale saddle-point problems. It solves iteratively a sequence of auxiliary saddle-point problems in low-dimensional subspaces, spanned by directions derived from first-order information over the primal \emph{and} dual variables. Proximal regularization is further deployed to stabilize the optimization process. Experimental results demonstrate significantly better convergence relative to popular first-order methods. We analyze the influence of the subspace on the convergence of the algorithm, and assess its performance in various deterministic optimization scenarios, such as bi-linear games, ADMM-based constrained optimization and generative adversarial networks.


Low-bit Quantization of Neural Networks for Efficient Inference

arXiv.org Machine Learning

Recent breakthrough methods in machine learning make use of increasingly large deep neural networks. The gains in performance have come at the cost of a substantial increase in computation and storage, making real-time implementation on limited hardware a very challenging task. One popular approach to address this challenge is to perform low-bit precision computations via neural network quantization. However, aggressive quantization generally entails a severe penalty in terms of accuracy and usually requires the retraining of the network or resorts to higher bit precision quantization. In this paper, we formalize the linear quantization task as a Minimum Mean Squared Error (MMSE) problem for both weights and activations. This allows low-bit precision inference without the need for full network retraining. The main contributions of our approach is the optimization of the constrained MSE problem at each layer of the network, the hardware aware partitioning of the neural network parameters, and the use of multiple low precision quantized tensors for poorly approximated layers. The proposed approach allows for the first time a linear 4 bits integer precision (INT4) quantization for deployment of pretrained models on limited hardware resources.


Blind Source Separation via Multinode Sparse Representation

Neural Information Processing Systems

We consider a problem of blind source separation from a set of instantaneous linear mixtures, where the mixing matrix is unknown. It was discovered recently, that exploiting the sparsity of sources in an appropriate representation according to some signal dictionary, dramatically improves the quality of separation. In this work we use the property of multi scale transforms, such as wavelet or wavelet packets, to decompose signals into sets of local features with various degrees of sparsity. We use this intrinsic property for selecting the best (most sparse) subsets of features for further separation. The performance of the algorithm is verified on noise-free and noisy data. Experiments with simulated signals, musical sounds and images demonstrate significant improvement of separation quality over previously reported results.


Blind Source Separation via Multinode Sparse Representation

Neural Information Processing Systems

We consider a problem of blind source separation from a set of instantaneous linearmixtures, where the mixing matrix is unknown. It was discovered recently, that exploiting the sparsity of sources in an appropriate representationaccording to some signal dictionary, dramatically improves the quality of separation. In this work we use the property of multi scale transforms, such as wavelet or wavelet packets, to decompose signals into sets of local features with various degrees of sparsity. We use this intrinsic property for selecting the best (most sparse) subsets of features for further separation. The performance of the algorithm is verified onnoise-free and noisy data. Experiments with simulated signals, musical sounds and images demonstrate significant improvement of separation qualityover previously reported results. 1 Introduction