AITopics | Krishna, Tushar

Collaborating Authors

Krishna, Tushar

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

RASA: Efficient Register-Aware Systolic Array Matrix Engine for CPU

Jeong, Geonhwa, Qin, Eric, Samajdar, Ananda, Hughes, Christopher J., Subramoney, Sreenivas, Kim, Hyesoon, Krishna, Tushar

arXiv.org Artificial IntelligenceOct-4-2021

As AI-based applications become pervasive, CPU vendors are starting to incorporate matrix engines within the datapath to boost efficiency. Systolic arrays have been the premier architectural choice as matrix engines in offload accelerators. However, we demonstrate that incorporating them inside CPUs can introduce under-utilization and stalls due to limited register storage to amortize the fill and drain times of the array. To address this, we propose RASA, Register-Aware Systolic Array. We develop techniques to divide an execution stage into several sub-stages and overlap instructions to hide overheads and run them concurrently. RASA-based designs improve performance significantly with negligible area and power overhead.

artificial intelligence, machine learning, neural network, (16 more...)

arXiv.org Artificial Intelligence

2110.01752

Genre: Research Report (0.64)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

Add feedback

AIRCHITECT: Learning Custom Architecture Design and Mapping Space

Samajdar, Ananda, Joseph, Jan Moritz, Denton, Matthew, Krishna, Tushar

arXiv.org Artificial IntelligenceAug-16-2021

Design space exploration is an important but costly step involved in the design/deployment of custom architectures to squeeze out maximum possible performance and energy efficiency. Conventionally, optimizations require iterative sampling of the design space using simulation or heuristic tools. In this paper we investigate the possibility of learning the optimization task using machine learning and hence using the learnt model to predict optimal parameters for the design and mapping space of custom architectures, bypassing any exploration step. We use three case studies involving the optimal array design, SRAM buffer sizing, mapping, and schedule determination for systolic-array-based custom architecture design and mapping space. Within the purview of these case studies, we show that it is possible to capture the design space and train a model to "generalize" prediction the optimal design and mapping parameters when queried with workload and design constraints. We perform systematic design-aware and statistical analysis of the optimization space for our case studies and highlight the patterns in the design space. We formulate the architecture design and mapping as a machine learning problem that allows us to leverage existing ML models for training and inference. We design and train a custom network architecture called AIRCHITECT, which is capable of learning the architecture design space with as high as 94.3% test accuracy and predicting optimal configurations which achieve on average (GeoMean) of 99.9% the best possible performance on a test dataset with $10^5$ GEMM workloads.

artificial intelligence, dataflow, neural network, (18 more...)

arXiv.org Artificial Intelligence

2108.08295

Country: Europe > Germany (0.14)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.94)
Information Technology > Artificial Intelligence > Representation & Reasoning > Search (0.68)

Add feedback

Evaluating Spatial Accelerator Architectures with Tiled Matrix-Matrix Multiplication

Moon, Gordon E., Kwon, Hyoukjun, Jeong, Geonhwa, Chatarasi, Prasanth, Rajamanickam, Sivasankaran, Krishna, Tushar

arXiv.org Artificial IntelligenceJun-19-2021

There is a growing interest in custom spatial accelerators for machine learning applications. These accelerators employ a spatial array of processing elements (PEs) interacting via custom buffer hierarchies and networks-on-chip. The efficiency of these accelerators comes from employing optimized dataflow (i.e., spatial/temporal partitioning of data across the PEs and fine-grained scheduling) strategies to optimize data reuse. The focus of this work is to evaluate these accelerator architectures using a tiled general matrix-matrix multiplication (GEMM) kernel. To do so, we develop a framework that finds optimized mappings (dataflow and tile sizes) for a tiled GEMM for a given spatial accelerator and workload combination, leveraging an analytical cost model for runtime and energy. Our evaluations over five spatial accelerators demonstrate that the tiled GEMM mappings systematically generated by our framework achieve high performance on various GEMM workloads and accelerators.

accelerator, artificial intelligence, neural network, (19 more...)

arXiv.org Artificial Intelligence

2106.10499

Country: North America > United States (0.67)

Genre: Research Report > New Finding (0.46)

Industry:

Energy (0.68)
Government > Regional Government (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (0.89)

Add feedback

Domain-specific Genetic Algorithm for Multi-tenant DNNAccelerator Scheduling

Kao, Sheng-Chun, Krishna, Tushar

arXiv.org Artificial IntelligenceApr-30-2021

As Deep Learning continues to drive a variety of applications in datacenters and HPC, there is a growing trend towards building large accelerators with several sub-accelerator cores/chiplets. This work looks at the problem of supporting multi-tenancy on such accelerators. In particular, we focus on the problem of mapping layers from several DNNs simultaneously on an accelerator. Given the extremely large search space, we formulate the search as an optimization problem and develop a specialized genetic algorithm called G# withcustom operators to enable structured sample-efficient exploration. We quantitatively compare G# with several common heuristics, state-of-the-art optimization methods, and reinforcement learning methods across different accelerator set-tings (large/small accelerators) and different sub-accelerator configurations (homogeneous/heterogeneous), and observeG# can consistently find better solutions. Further, to enable real-time scheduling, we also demonstrate a method to generalize the learnt schedules and transfer them to the next batch of jobs, reducing schedule compute time to near zero.

accelerator, deep learning, neural network, (22 more...)

arXiv.org Artificial Intelligence

2104.13997

Country:

Oceania > Australia (0.14)
North America > United States (0.14)

Genre: Research Report (0.63)

Industry: Information Technology (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Planning & Scheduling (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
(2 more...)

Add feedback

Restructuring, Pruning, and Adjustment of Deep Models for Parallel Distributed Inference

Abdi, Afshin, Rashidi, Saeed, Fekri, Faramarz, Krishna, Tushar

arXiv.org Machine LearningAug-19-2020

Using multiple nodes and parallel computing algorithms has become a principal tool to improve training and execution times of deep neural networks as well as effective collective intelligence in sensor networks. In this paper, we consider the parallel implementation of an already-trained deep model on multiple processing nodes (a.k.a. workers) where the deep model is divided into several parallel sub-models, each of which is executed by a worker. Since latency due to synchronization and data transfer among workers negatively impacts the performance of the parallel implementation, it is desirable to have minimum interdependency among parallel sub-models. To achieve this goal, we propose to rearrange the neurons in the neural network and partition them (without changing the general topology of the neural network), such that the interdependency among sub-models is minimized under the computations and communications constraints of the workers. We propose RePurpose, a layer-wise model restructuring and pruning technique that guarantees the performance of the overall parallelized model. To efficiently apply RePurpose, we propose an approach based on $\ell_0$ optimization and the Munkres assignment algorithm. We show that, compared to the existing methods, RePurpose significantly improves the efficiency of the distributed inference via parallel implementation, both in terms of communication and computational complexity.

deep learning, neural network, repurpose, (19 more...)

arXiv.org Machine Learning

2008.08289

Country: North America > United States > California > San Francisco County > San Francisco (0.14)

Genre: Research Report (0.64)

Industry: Information Technology (0.46)

Technology:

Information Technology > Communications > Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.67)

Add feedback

Generative Design of Hardware-aware DNNs

Kao, Sheng-Chun, Ramamurthy, Arun, Krishna, Tushar

arXiv.org Machine LearningJul-12-2020

To efficiently run DNNs on the edge/cloud, many new DNN inference accelerators are being designed and deployed frequently. To enhance the resource efficiency of DNNs, model quantization is a widely-used approach. However, different accelerator/HW has different resources leading to the need for specialized quantization strategy of each HW. Moreover, using the same quantization for every layer may be sub-optimal, increasing the designspace of possible quantization choices. This makes manual-tuning infeasible. Recent work in automatically determining quantization for each layer is driven by optimization methods such as reinforcement learning. However, these approaches need re-training the RL for every new HW platform. We propose a new way for autonomous quantization and HW-aware tuning. We propose a generative model, AQGAN, which takes a target accuracy as the condition and generates a suite of quantization configurations. With the conditional generative model, the user can autonomously generate different configurations with different targets in inference time. Moreover, we propose a simplified HW-tuning flow, which uses the generative model to generate proposals and execute simple selection based on the HW resource budget, whose process is fast and interactive. We evaluate our model on five of the widely-used efficient models on the ImageNet dataset. We compare with existing uniform quantization and state-of-the-art autonomous quantization methods. Our generative model shows competitive achieved accuracy, however, with around two degrees less search cost for each design point. Our generative model shows the generated quantization configuration can lead to less than 3.5% error across all experiments.

accuracy, deep learning, neural network, (17 more...)

arXiv.org Machine Learning

2006.03968

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (0.88)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

Conditional Neural Architecture Search

Kao, Sheng-Chun, Ramamurthy, Arun, Williams, Reed, Krishna, Tushar

arXiv.org Machine LearningJun-6-2020

Designing resource-efficient Deep Neural Networks (DNNs) is critical to deploy deep learning solutions over edge platforms due to diverse performance, power, and memory budgets. Unfortunately, it is often the case a well-trained ML model does not fit to the constraint of deploying edge platforms, causing a long iteration of model reduction and retraining process. Moreover, a ML model optimized for platform-A often may not be suitable when we deploy it on another platform-B, causing another iteration of model retraining. We propose a conditional neural architecture search method using GAN, which produces feasible ML models for different platforms. We present a new workflow to generate constraint-optimized DNN models. This is the first work of bringing in condition and adversarial technique into Neural Architecture Search domain. We verify the method with regression problems and classification on CIFAR-10. The proposed workflow can successfully generate resource-optimized MLP or CNN-based networks.

constraint, deep learning, neural network, (15 more...)

arXiv.org Machine Learning

2006.03969

Country: North America > United States (0.14)

Genre: Research Report (0.64)

Industry: Education (0.88)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Reinforcement Learning based Interconnection Routing for Adaptive Traffic Optimization

Kao, Sheng-Chun, Yang, Chao-Han Huck, Chen, Pin-Yu, Ma, Xiaoli, Krishna, Tushar

arXiv.org Artificial IntelligenceAug-13-2019

Applying Machine Learning (ML) techniques to design and optimize computer architectures is a promising research direction. Optimizing the runtime performance of a Network-on-Chip (NoC) necessitates a continuous learning framework. In this work, we demonstrate the promise of applying reinforcement learning (RL) to optimize NoC runtime performance. We present three RL-based methods for learning optimal routing algorithms. The experimental results show the algorithms can successfully learn a near-optimal solution across different environment states. Reproducible Code: github.com/huckiyang/interconnect-routing-gym

artificial intelligence, reinforcement learning, télécommunications, (15 more...)

arXiv.org Artificial Intelligence

doi: 10.1145/3313231.335236

1908.04484

Genre: Research Report > New Finding (0.35)

Industry: Telecommunications (0.43)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

Add feedback