scalable method
Scalable methods for 8-bit training of neural networks
Quantized Neural Networks (QNNs) are often used to improve network efficiency during the inference phase, i.e. after the network has been trained. Extensive research in the field suggests many different quantization schemes. Still, the number of bits required, as well as the best quantization scheme, are yet unknown. Our theoretical analysis suggests that most of the training process is robust to substantial precision reduction, and points to only a few specific operations that require higher precision.
Scalable Methods for Nonnegative Matrix Factorizations of Near-separable Tall-and-skinny Matrices
Numerous algorithms are used for nonnegative matrix factorization under the assumption that the matrix is nearly separable. In this paper, we show how to make these algorithms scalable for data matrices that have many more rows than columns, so-called tall-and-skinny matrices. One key component to these improved methods is an orthogonal matrix transformation that preserves the separability of the NMF problem. Our final methods need to read the data matrix only once and are suitable for streaming, multi-core, and MapReduce architectures. We demonstrate the efficacy of these algorithms on terabyte-sized matrices from scientific computing and bioinformatics.
Reviews: Scalable methods for 8-bit training of neural networks
This is interesting given the fact that most of the existing works are based on 16 bit and people are having some difficulties in training 8bit models. The paper identified that the training difficulty comes from batchnorm and it proposed a variant of batchnorm called range batchnorm which alleviate the numerical instability of the original batchnorm occurring with the quantized models. By such simple modification, the paper shows that a 8 bit model can be easily trained using GEMMLOWP, an existing framework. The paper also tried to analyze and understand the proposed approach in a theoretical manner. Experiments well supported the argument of the paper.
AtP*: An efficient and scalable method for localizing LLM behaviour to components
Kramár, János, Lieberum, Tom, Shah, Rohin, Nanda, Neel
As LLMs become ubiquitous and integrated into numerous digital applications, it's an increasingly pressing research problem to understand the internal mechanisms that underlie their behaviour - this is the problem of mechanistic interpretability. A fundamental subproblem is to causally attribute particular behaviours to individual parts of the transformer forward pass, corresponding to specific components (such as attention heads, neurons, layer contributions, or residual streams), often at specific positions in the input token sequence. This is important because in numerous case studies of complex behaviours, they are found to be driven by sparse subgraphs within the model (Meng et al., 2023; Olsson et al., 2022; Wang et al., 2022). A classic form of causal attribution uses zero-ablation, or knock-out, where a component is deleted and we see if this negatively affects a model's output - a negative effect implies the component was causally important. More recent work has generalised this to replacing a component's activations with samples from some baseline distribution (with zero-ablation being a special case where activations are resampled to be zero). We focus on the popular and widely used method of Activation Patching (also known as causal mediation analysis) (Chan et al., 2022; Geiger et al., 2022; Meng et al., 2023) where the baseline distribution is a component's activations on some corrupted input, such as an alternate string with a different answer (Pearl, 2001; Robins and Greenland, 1992). Given a causal attribution method, it is common to sweep across all model components, directly evaluating the effect of intervening on each of them via resampling (Meng et al., 2023). However, when working with SoTA models it can be expensive to attribute behaviour especially to small components (e.g.
- North America > Greenland (0.24)
- North America > United States > Hawaii > Honolulu County > Honolulu (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- (2 more...)
HyperPPO: A scalable method for finding small policies for robotic control
Hegde, Shashank, Huang, Zhehui, Sukhatme, Gaurav S.
Models with fewer parameters are necessary for the neural control of memory-limited, performant robots. Finding these smaller neural network architectures can be time-consuming. We propose HyperPPO, an on-policy reinforcement learning algorithm that utilizes graph hypernetworks to estimate the weights of multiple neural architectures simultaneously. Our method estimates weights for networks that are much smaller than those in common-use networks yet encode highly performant policies. We obtain multiple trained policies at the same time while maintaining sample efficiency and provide the user the choice of picking a network architecture that satisfies their computational constraints. We show that our method scales well - more training resources produce faster convergence to higher-performing architectures. We demonstrate that the neural policies estimated by HyperPPO are capable of decentralized control of a Crazyflie2.1 quadrotor. Website: https://sites.google.com/usc.edu/hyperppo
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.87)
- Information Technology > Artificial Intelligence > Systems & Languages > Problem-Independent Architectures (0.53)
- Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.53)
A Scalable Method for Scheduling Distributed Energy Resources using Parallelized Population-based Metaheuristics
Khalloof, Hatem, Jakob, Wilfried, Shahoud, Shadi, Duepmeier, Clemens, Hagenmeyer, Veit
Recent years have seen an increasing integration of distributed renewable energy resources into existing electric power grids. Due to the uncertain nature of renewable energy resources, network operators are faced with new challenges in balancing load and generation. In order to meet the new requirements, intelligent distributed energy resource plants can be used. However, the calculation of an adequate schedule for the unit commitment of such distributed energy resources is a complex optimization problem which is typically too complex for standard optimization algorithms if large numbers of distributed energy resources are considered. For solving such complex optimization tasks, population-based metaheuristics -- as, e.g., evolutionary algorithms -- represent powerful alternatives. Admittedly, evolutionary algorithms do require lots of computational power for solving such problems in a timely manner. One promising solution for this performance problem is the parallelization of the usually time-consuming evaluation of alternative solutions. In the present paper, a new generic and highly scalable parallel method for unit commitment of distributed energy resources using metaheuristic algorithms is presented. It is based on microservices, container virtualization and the publish/subscribe messaging paradigm for scheduling distributed energy resources. Scalability and applicability of the proposed solution are evaluated by performing parallelized optimizations in a big data environment for three distinct distributed energy resource scheduling scenarios. Thereby, unlike all other optimization methods in the literature, the new method provides cluster or cloud parallelizability and is able to deal with a comparably large number of distributed energy resources. The application of the new proposed method results in very good performance for scaling up optimization speed.
- North America > United States > District of Columbia > Washington (0.05)
- Europe > Germany > Baden-Württemberg > Karlsruhe Region > Karlsruhe (0.05)
- North America > United States > New York > New York County > New York City (0.04)
- (2 more...)
Scalable methods for 8-bit training of neural networks
Banner, Ron, Hubara, Itay, Hoffer, Elad, Soudry, Daniel
Quantized Neural Networks (QNNs) are often used to improve network efficiency during the inference phase, i.e. after the network has been trained. Extensive research in the field suggests many different quantization schemes. Still, the number of bits required, as well as the best quantization scheme, are yet unknown. Our theoretical analysis suggests that most of the training process is robust to substantial precision reduction, and points to only a few specific operations that require higher precision. Additionally, as QNNs require batch-normalization to be trained at high precision, we introduce Range Batch-Normalization (BN) which has significantly higher tolerance to quantization noise and improved computational complexity.