AITopics | Tang, Hanlin

Plotting

Tang, Hanlin

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

EasyQuant: An Efficient Data-free Quantization Algorithm for LLMs

Tang, Hanlin, Sun, Yifu, Wu, Decheng, Liu, Kai, Zhu, Jianchen, Kang, Zhanhui

arXiv.org Artificial IntelligenceMar-5-2024

Large language models (LLMs) have proven to be very superior to conventional methods in various tasks. However, their expensive computations and high memory requirements are prohibitive for deployment. Model quantization is an effective method for reducing this overhead. The problem is that in most previous works, the quantized model was calibrated using few samples from the training data, which might affect the generalization of the quantized LLMs to unknown cases and tasks. Hence in this work, we explore an important question: Can we design a data-independent quantization method for LLMs to guarantee its generalization performance? In this work, we propose EasyQuant, a training-free and data-independent weight-only quantization algorithm for LLMs. Our observation indicates that two factors: outliers in the weight and quantization ranges, are essential for reducing the quantization error. Therefore, in EasyQuant, we leave the outliers (less than 1%) unchanged and optimize the quantization range to reduce the reconstruction error. With these methods, we surprisingly find that EasyQuant achieves comparable performance to the original model. Since EasyQuant does not depend on any training data, the generalization performance of quantized LLMs is safely guaranteed. Moreover, EasyQuant can be implemented in parallel so that the quantized model could be attained in a few minutes even for LLMs over 100B. To our best knowledge, we are the first work that achieves almost lossless quantization performance for LLMs under a data-independent setting and our algorithm runs over 10 times faster than the data-dependent methods.

artificial intelligence, large language model, natural language, (18 more...)

arXiv.org Artificial Intelligence

2403.02775

Genre: Research Report > Promising Solution (0.34)

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

Add feedback

On the geometry of generalization and memorization in deep neural networks

Stephenson, Cory, Padhy, Suchismita, Ganesh, Abhinav, Hui, Yue, Tang, Hanlin, Chung, SueYeon

arXiv.org Machine LearningMay-30-2021

This part of the gradient behaves similarly for permuted and unpermuted examples. In Eq. 25 we see that the contribution to the label dependent part of the gradient from permuted examples vanishes for large datasets, while the contribution from unpermuted examples does not provided the cross correlation between input features and labels is nonzero. This suggests that with small weight initialization, the gradient descent dynamics initially ignores the labels of permuted examples. Figure A.1 shows a breakdown of how the two components of the gradient computed on both unpermuted and permuted examples evolve over the course of training for the different layers of the VGG16 model trained on CIFAR-100. We see that the label dependent part behaves qualitatively differently for the unpermuted examples than for the permuted examples, as the permuted examples give close to zero contribution early in training in agreement with Eq. 25. The label independent part of the gradient shows similar trends between unpermuted and permuted examples, though in the final epochs, the unpermuted examples have a slightly larger label independent gradient indicating slightly greater model confidence on these examples. As the label dependent and label independent parts of the gradient have differing signs, they compete with each other and cancel when the loss is minimized, but are not independently zero and in fact grow during training. The slightly larger label independent gradient for unpermuted examples is balanced by a corresponding slightly larger label dependent gradient at the end of training.

deep learning, manifold, neural network, (19 more...)

arXiv.org Machine Learning

2105.14602

Genre:

Research Report > New Finding (0.93)
Research Report > Experimental Study (0.68)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.64)
Information Technology > Artificial Intelligence > Machine Learning > Memory-Based Learning > Rote Learning (0.43)

Add feedback

APMSqueeze: A Communication Efficient Adam-Preconditioned Momentum SGD Algorithm

Tang, Hanlin, Gan, Shaoduo, Rajbhandari, Samyam, Lian, Xiangru, Liu, Ji, He, Yuxiong, Zhang, Ce

arXiv.org Machine LearningAug-27-2020

Adam is the important optimization algorithm to guarantee efficiency and accuracy for training many important tasks such as BERT and ImageNet. However, Adam is generally not compatible with information (gradient) compression technology. Therefore, the communication usually becomes the bottleneck for parallelizing Adam. In this paper, we propose a communication efficient {\bf A}DAM {\bf p}reconditioned {\bf M}omentum SGD algorithm-- named APMSqueeze-- through an error compensated method compressing gradients. The proposed algorithm achieves a similar convergence efficiency to Adam in term of epochs, but significantly reduces the running time per epoch. In terms of end-to-end performance (including the full-precision pre-condition step), APMSqueeze is able to provide {sometimes by up to $2-10\times$ speed-up depending on network bandwidth.} We also conduct theoretical analysis on the convergence and efficiency.

apmsqueeze, artificial intelligence, optimization problem, (18 more...)

arXiv.org Machine Learning

2008.11343

Country:

Europe (0.67)
North America > United States > California (0.14)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.34)

Add feedback

Optimizing Memory Placement using Evolutionary Graph Reinforcement Learning

Khadka, Shauharda, Aflalo, Estelle, Marder, Mattias, Ben-David, Avrech, Miret, Santiago, Tang, Hanlin, Mannor, Shie, Hazan, Tamir, Majumdar, Somdeb

arXiv.org Artificial IntelligenceJul-14-2020

As modern neural networks have grown to billions of parameters, meeting tight latency budgets has become increasingly challenging. Approaches like compression, sparsification and network pruning have proven effective to tackle this problem - but they rely on modifications of the underlying network. In this paper, we look at a complimentary approach of optimizing how tensors are mapped to on-chip memory in an inference accelerator while leaving the network parameters untouched. Since different memory components trade off capacity for bandwidth differently, a sub-optimal mapping can result in high latency. We introduce evolutionary graph reinforcement learning (EGRL) - a method combining graph neural networks, reinforcement learning (RL) and evolutionary search - that aims to find the optimal mapping to minimize latency. Furthermore, a set of fast, stateless policies guide the evolutionary search to improve sample-efficiency. We train and validate our approach directly on the Intel NNP-I chip for inference using a batch size of 1. EGRL outperforms policy-gradient, evolutionary search and dynamic programming baselines on BERT, ResNet-101 and ResNet-50. We achieve 28-78% speed-up compared to the native NNP-I compiler on all three workloads.

deep learning, neural network, workload, (19 more...)

arXiv.org Artificial Intelligence

2007.07298

Country: North America > United States (0.14)

Genre: Research Report (0.64)

Industry: Information Technology (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Search (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Evolutionary Systems (1.00)

Add feedback

Central Server Free Federated Learning over Single-sided Trust Social Networks

He, Chaoyang, Tan, Conghui, Tang, Hanlin, Qiu, Shuang, Liu, Ji

arXiv.org Machine LearningOct-10-2019

State-of-the-art federated learning adopts the centralized network architecture where a centralized node collects the gradients sent from child agents to update the global model. Despite its simplicity, the centralized method suffers from communication and computational bottlenecks in the central node, especially for federated learning, where a large number of clients are usually involved. Moreover, to prevent reverse engineering of the user's identity, a certain amount of noise must be added to the gradient to protect user privacy, which partially sacrifices the efficiency and the accuracy (Shokri and Shmatikov, 2015). To further protect the data privacy and avoid the communication bottleneck, the decentralized architecture has been recently proposed (Vanhaesebrouck et al., 2017; Bellet et al., 2018), where the centralized node has been removed, and each node only communicates with its neighbors (with mutual trust) by exchanging their local models. Exchanging local models is usually favored with respect to the data privacy protection over sending private gradients because the local model is the aggregation or mixture of quite a large amount of data while the local gradient directly reflects only one or a batch of private data samples. Although advantages of decentralized architecture have been well recognized over the state-of-the-art method (its centralized counterpart), it usually can only be run on the network with mutual trusts . That is, two nodes (or users) can exchange their local models only if they trust each other reciprocally (e.g.

artificial intelligence, federated learning, information technology services, (15 more...)

arXiv.org Machine Learning

1910.04956

Country: North America > United States > California (0.14)

Genre: Research Report (1.00)

Industry: Information Technology > Security & Privacy (1.00)

Technology:

Information Technology > Security & Privacy (1.00)
Information Technology > Communications > Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.68)

Add feedback

MLPerf Training Benchmark

Mattson, Peter, Cheng, Christine, Coleman, Cody, Diamos, Greg, Micikevicius, Paulius, Patterson, David, Tang, Hanlin, Wei, Gu-Yeon, Bailis, Peter, Bittorf, Victor, Brooks, David, Chen, Dehao, Dutta, Debojyoti, Gupta, Udit, Hazelwood, Kim, Hock, Andrew, Huang, Xinyuan, Jia, Bill, Kang, Daniel, Kanter, David, Kumar, Naveen, Liao, Jeffery, Narayanan, Deepak, Oguntebi, Tayo, Pekhimenko, Gennady, Pentecost, Lillian, Reddi, Vijay Janapa, Robie, Taylor, John, Tom St., Wu, Carole-Jean, Xu, Lingjie, Young, Cliff, Zaharia, Matei

arXiv.org Machine LearningOct-2-2019

Machine learning is experiencing an explosion of software and hardware solutions, and needs industry-standard performance benchmarks to drive design and enable competitive evaluation. However, machine learning training presents a number of unique challenges to benchmarking that do not exist in other domains: (1) some optimizations that improve training throughput actually increase time to solution, (2) training is stochastic and time to solution has high variance, and (3) the software and hardware systems are so diverse that they cannot be fairly benchmarked with the same binary, code, or even hyperparameters. We present MLPerf, a machine learning benchmark that overcomes these challenges. We quantitatively evaluate the efficacy of MLPerf in driving community progress on performance and scalability across two rounds of results from multiple vendors.

benchmark, deep learning, it software, (22 more...)

arXiv.org Machine Learning

1910.015

Country:

North America > United States > California (0.14)
North America > Canada > Ontario > Toronto (0.14)

Genre: Research Report (0.82)

Industry:

Leisure & Entertainment > Games (0.68)
Information Technology > Software (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

$\texttt{DeepSqueeze}$: Parallel Stochastic Gradient Descent with Double-Pass Error-Compensated Compression

Tang, Hanlin, Lian, Xiangru, Qiu, Shuang, Yuan, Lei, Zhang, Ce, Zhang, Tong, Liu, Ji

arXiv.org Machine LearningJul-17-2019

Communication is a key bottleneck in distributed training. Recently, an \emph{error-compensated} compression technology was particularly designed for the \emph{centralized} learning and receives huge successes, by showing significant advantages over state-of-the-art compression based methods in saving the communication cost. Since the \emph{decentralized} training has been witnessed to be superior to the traditional \emph{centralized} training in the communication restricted scenario, therefore a natural question to ask is "how to apply the error-compensated technology to the decentralized learning to further reduce the communication cost." However, a trivial extension of compression based centralized training algorithms does not exist for the decentralized scenario. key difference between centralized and decentralized training makes this extension extremely non-trivial. In this paper, we propose an elegant algorithmic design to employ error-compensated stochastic gradient descent for the decentralized scenario, named $\texttt{DeepSqueeze}$. Both the theoretical analysis and the empirical study are provided to show the proposed $\texttt{DeepSqueeze}$ algorithm outperforms the existing compression based decentralized learning algorithms. To the best of our knowledge, this is the first time to apply the error-compensated compression to the decentralized learning.

artificial intelligence, compression, machine learning, (17 more...)

arXiv.org Machine Learning

1907.07346

Country:

Europe (0.28)
North America > United States > California (0.14)

Genre: Research Report (1.00)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (1.00)

Add feedback

Generalization to Novel Objects using Prior Relational Knowledge

Vijay, Varun Kumar, Ganesh, Abhinav, Tang, Hanlin, Bansal, Arjun

arXiv.org Artificial IntelligenceJun-26-2019

To solve tasks in new environments involving objects unseen during training, agents must reason over prior information about those objects and their relations. We introduce the Prior Knowledge Graph network, an architecture for combining prior information, structured as a knowledge graph, with a symbolic parsing of the visual scene, and demonstrate that this approach is able to apply learned relations to novel objects whereas the baseline algorithms fail. Ablation experiments show that the agents ground the knowledge graph relations to semantically-relevant behaviors. In both a Sokoban game and the more complex Pacman environment, our network is also more sample efficient than the baselines, reaching the same performance in 5-10x fewer episodes. Once the agents are trained with our approach, we can manipulate agent behavior by modifying the knowledge graph in semantically meaningful ways. These results suggest that our network provides a framework for agents to reason over structured knowledge graphs while still leveraging gradient based learning approaches.

computer game, knowledge graph, neural network, (19 more...)

arXiv.org Artificial Intelligence

1906.11315

Country: North America > United States (0.46)

Genre: Research Report > New Finding (0.66)

Industry: Leisure & Entertainment > Games > Computer Games (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Semantic Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.47)

Add feedback

Communication Compression for Decentralized Training

Tang, Hanlin, Gan, Shaoduo, Zhang, Ce, Zhang, Tong, Liu, Ji

Neural Information Processing SystemsDec-31-2018

Optimizing distributed learning systems is an art of balancing between computation and communication. There have been two lines of research that try to deal with slower networks: {\em communication compression} for low bandwidth networks, and {\em decentralization} for high latency networks. In this paper, We explore a natural question: {\em can the combination of both techniques lead to a system that is robust to both bandwidth and latency?} Although the system implication of such combination is trivial, the underlying theoretical principle and algorithm design is challenging: unlike centralized algorithms, simply compressing {\rc exchanged information, even in an unbiased stochastic way, within the decentralized network would accumulate the error and cause divergence.} In this paper, we develop a framework of quantized, decentralized training and propose two different strategies, which we call {\em extrapolation compression} and {\em difference compression}. We analyze both algorithms and prove both converge at the rate of $O(1/\sqrt{nT})$ where $n$ is the number of workers and $T$ is the number of iterations, matching the convergence rate for full precision, centralized training. We validate our algorithms and find that our proposed algorithm outperforms the best of merely decentralized and merely quantized algorithm significantly for networks with {\em both} high latency and low bandwidth.

algorithm, artificial intelligence, machine learning, (16 more...)

Neural Information Processing Systems

Country: