Goto

Collaborating Authors

 switchback






Jetfire: Efficient and Accurate Transformer Pretraining with INT8 Data Flow and Per-Block Quantization

Xi, Haocheng, Chen, Yuxiang, Zhao, Kang, Zheng, Kaijun, Chen, Jianfei, Zhu, Jun

arXiv.org Artificial Intelligence

Pretraining transformers are generally time-consuming. Fully quantized training (FQT) is a promising approach to speed up pretraining. However, most FQT methods adopt a quantize-compute-dequantize procedure, which often leads to suboptimal speedup and significant performance degradation when used in transformers due to the high memory access overheads and low-precision computations. In this work, we propose Jetfire, an efficient and accurate INT8 training method specific to transformers. Our method features an INT8 data flow to optimize memory access and a per-block quantization method to maintain the accuracy of pretrained transformers. Extensive experiments demonstrate that our INT8 FQT method achieves comparable accuracy to the FP16 training baseline and outperforms the existing INT8 training works for transformers. Moreover, for a standard transformer block, our method offers an end-to-end training speedup of 1.42x and a 1.49x memory reduction compared to the FP16 baseline.


Stable and low-precision training for large-scale vision-language models

Wortsman, Mitchell, Dettmers, Tim, Zettlemoyer, Luke, Morcos, Ari, Farhadi, Ali, Schmidt, Ludwig

arXiv.org Artificial Intelligence

We introduce new methods for 1) accelerating and 2) stabilizing training for large language-vision models. 1) For acceleration, we introduce SwitchBack, a linear layer for int8 quantized training which provides a speed-up of 13-25% while matching the performance of bfloat16 training within 0.1 percentage points for the 1B parameter CLIP ViT-Huge -- the largest int8 training to date. Our main focus is int8 as GPU support for float8 is rare, though we also analyze float8 training through simulation. While SwitchBack proves effective for float8, we show that standard techniques are also successful if the network is trained and initialized so that large feature magnitudes are discouraged, which we accomplish via layer-scale initialized with zeros. 2) For stability, we analyze loss spikes and find they consistently occur 1-8 iterations after the squared gradients become under-estimated by their AdamW second moment estimator. As a result, we recommend an AdamW-Adafactor hybrid which avoids loss spikes when training a CLIP ViT-Huge model and outperforms gradient clipping at the scales we test.


ESA's Solar Orbiter records a mysterious magnetic switchback

Daily Mail - Science & tech

The European Space Agency's Solar Orbiter spacecraft has captured the reversal of the Sun's magnetic field on camera for the first time. These reversals, known as magnetic switchbacks, have previously been hypothesised, but until now have not been observed directly. The new observation provides a full view of the structure and confirms that magnetic switchbacks have an S-shaped character. ESA hopes the footage will help to unravel the mystery of how their physical formation mechanism might help accelerate solar winds. Scientists develop a'recipe' for parents to stop babies crying Meghan Markle's handshake is ignored by member of the public Kremlin journalist admits Russia is losing'huge number of people' Thousands gather for arrival of Queen's coffin at Buckingham Palace The European Space Agency's Solar Orbiter spacecraft has captured the reversal of the Sun's magnetic field on camera for the first time.


An Oral History of the 2004 Darpa Grand Challenge

WIRED

On March 13, 2004, a gaggle of engineers and a few thousand spectators congregated outside a California dive bar to watch 15 self-driving cars speed across the Mojave Desert in the first-ever Darpa Grand Challenge. Before the start of the race, which marked the first big push toward a fully autonomous vehicle, the grounds surrounding the bar teemed with sweaty, stressed, sleep-deprived geeks, desperately tinkering with their motley assortment of driver less Frankencars: SUVs, dune buggies, monster trucks, even a motorcycle. After the race, they left behind a vehicular graveyard littered with smashed fence posts, messes of barbed wire, and at least one empty fire extinguisher. What happened in between--the rush out of the starter gate, the switchbacks across the rocky terrain, the many, many crashes--didn't just hint at the possibilities and potential limitations of autonomous vehicles that auto and tech companies are facing and that consumers will experience in the coming years as driverless vehicles swarm the roads. It created the self- driving community as we know it today, the men and women in too-big polo shirts who would go on to dominate an automotive revolution. In 2001, eager to keep soldiers away from harm in combat zones, the US Congress demanded that a third of the military's ground combat vehicles be uncrewed by 2015. But defense industry stalwarts weren't innovating quickly enough on the sensor and computing technologies that would enable autonomous driving.


Faster Optimal and Suboptimal Hierarchical Search

Leighton, Michael J. (University of New Hampshire) | Ruml, Wheeler ( University of New Hampshire ) | Holte, Robert C. (University of Alberta)

AAAI Conferences

In problem domains for which an informed admissible heuristic function is not available, one attractive approach is hierarchical search. Hierarchical search uses search in an abstracted version of the problem to dynamically generate heuristic values. This paper makes two contributions to hierarchical search. First, we propose a simple modification to the state-of-the-art algorithm Switchback that reduces the number of expansions (and hence the running time) by approximately half, while maintaining its guarantee of optimality. Second, we propose a new algorithm for suboptimal hierarchical search, called Switch. Empirical results suggest that Switch yields faster search than straightforward modifications of Switchback, such as weighting the heuristic or greedy search. The success of Switch illustrates the potential for further research on specifically suboptimal hierarchical search.


Searching Without a Heuristic: Efficient Use of Abstraction

Larsen, Bradford John (University of New Hampshire) | Burns, Ethan (University of New Hampshire) | Ruml, Wheeler (University of New Hampshire) | Holte, Robert (University of Alberta)

AAAI Conferences

In problem domains where an informative heuristic evaluation function is not known or not easily computed, abstraction can be used to derive admissible heuristic values. Optimal path lengths in the abstracted problem are consistent heuristic estimates for the original problem. Pattern databases are the traditional method of creating such heuristics, but they exhaustively compute costs for all abstract states and are thus usually appropriate only when all instances share the same single goal state. Hierarchical heuristic search algorithms address these shortcomings by searching for paths in the abstract space on an as-needed basis. However, existing hierarchical algorithms search less efficiently than pattern database constructors: abstract nodes may be expanded many times during the course of a base-level search. We present a novel hierarchical heuristic search algorithm, called Switchback, that uses an alternating direction of search to avoid abstract node re-expansions. This algorithm is simple to implement and demonstrates superior performance to existing hierarchical heuristic search algorithms on several standard benchmarks.