contraction path
Supplementary Material: Einsum Benchmark Mark Blacher
For what purpose was the dataset created? The dataset was created with two primary purposes. First, it serves as a benchmark for einsum libraries, enabling the assessment of both the efficiency in determining contraction paths and the performance in executing einsum expressions. The dataset instances were created by the authors. Who funded the creation of the dataset?
- Asia > Middle East > Jordan (0.04)
- Europe > Germany > Saarland (0.04)
- Asia > Myanmar > Tanintharyi Region > Dawei (0.04)
- North America > United States > California (0.04)
- Oceania > Australia > New South Wales > Sydney (0.04)
- Africa > Senegal > Kolda Region > Kolda (0.04)
- Information Technology (0.68)
- Government (0.46)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Natural Language (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
- Information Technology > Communications (0.93)
Comprehensive Design Space Exploration for Tensorized Neural Network Hardware Accelerators
Zhang, Jinsong, Li, Minghe, Tian, Jiayi, Lu, Jinming, Zhang, Zheng
High-order tensor decomposition has been widely adopted to obtain compact deep neural networks for edge deployment. However, existing studies focus primarily on its algorithmic advantages such as accuracy and compression ratio-while overlooking the hardware deployment efficiency. Such hardware-unaware designs often obscure the potential latency and energy benefits of tensorized models. Although several works attempt to reduce computational cost by optimizing the contraction sequence based on the number of multiply-accumulate operations, they typically neglect the underlying hardware characteristics, resulting in suboptimal real-world performance. We observe that the contraction path, hardware architecture, and dataflow mapping are tightly coupled and must be optimized jointly within a unified design space to maximize deployment efficiency on real devices. To this end, we propose a co-exploration framework that unifies these dimensions within a unified design space for efficient training and inference of tensorized neural networks on edge platforms. The framework formulates a latency oriented search objective and solves it via a global latency-driven exploration across the unified design space to achieve end-to-end model efficiency. The optimized configurations are implemented on a configurable FPGA kernel, achieving up to 4x and 3.85x lower inference and training latency compared with the dense baseline.
- North America > United States > California > Santa Barbara County > Santa Barbara (0.04)
- Europe (0.04)
- Africa > Senegal > Kolda Region > Kolda (0.04)
- Asia > Middle East > Jordan (0.04)
- Europe > Germany > Saarland (0.04)
- Asia > Myanmar > Tanintharyi Region > Dawei (0.04)
- North America > United States > California (0.04)
- Oceania > Australia > New South Wales > Sydney (0.04)
- Africa > Senegal > Kolda Region > Kolda (0.04)
- Information Technology (0.68)
- Government (0.46)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Natural Language (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
- Information Technology > Communications (0.93)
TENNs-PLEIADES: Building Temporal Kernels with Orthogonal Polynomials
Abstract-- We introduce a neural network named PLEIADES (PoLynomial Expansion In Adaptive Distributed Event-based Systems), belonging to the TENNs (Temporal Neural Networks) architecture. We focus on interfacing these networks with event-based data to perform online spatiotemporal classification and detection with low latency. By virtue of using structured temporal kernels and event-based data, we have the freedom to vary the sample rate of the data along with the discretization step-size of the network without additional finetuning. We experimented with three event-based benchmarks and obtained state-of-the-art results on all three by large margins with significantly smaller memory and compute costs. We achieved: 1) 99.59% accuracy with 192K parameters on the DVS128 hand gesture recognition dataset and 100% with a small additional output filter; 2) 99.58% test accuracy with 277K parameters on the AIS 2024 eye tracking challenge; and 3) 0.556 mAP with 576k parameters on the PROPHESEE 1 Megapixel Automotive Detection Dataset.
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- Europe > Netherlands > North Holland > Amsterdam (0.04)
- Information Technology (0.68)
- Education > Educational Setting > Online (0.54)
CoMERA: Computing- and Memory-Efficient Training via Rank-Adaptive Tensor Optimization
Yang, Zi, Choudhary, Samridhi, Xie, Xinfeng, Gao, Cao, Kunzmann, Siegfried, Zhang, Zheng
Training large AI models such as deep learning recommendation systems and foundation language (or multi-modal) models costs massive GPUs and computing time. The high training cost has become only affordable to big tech companies, meanwhile also causing increasing concerns about the environmental impact. This paper presents CoMERA, a Computing- and Memory-Efficient training method via Rank-Adaptive tensor optimization. CoMERA achieves end-to-end rank-adaptive tensor-compressed training via a multi-objective optimization formulation, and improves the training to provide both a high compression ratio and excellent accuracy in the training process. Our optimized numerical computation (e.g., optimized tensorized embedding and tensor-vector contractions) and GPU implementation eliminate part of the run-time overhead in the tensorized training on GPU. This leads to, for the first time, $2-3\times$ speedup per training epoch compared with standard training. CoMERA also outperforms the recent GaLore in terms of both memory and computing efficiency. Specifically, CoMERA is $2\times$ faster per training epoch and $9\times$ more memory-efficient than GaLore on a tested six-encoder transformer with single-batch training. With further HPC optimization, CoMERA may significantly reduce the training cost of large language models.
- Oceania > Australia > New South Wales > Sydney (0.04)
- North America > United States > California (0.04)
- Africa > Senegal > Kolda Region > Kolda (0.04)