Goto

Collaborating Authors

 lut


TinyLUT: Tiny Look-Up Table for Efficient Image Restoration at the Edge Huanan Li

Neural Information Processing Systems

Look-up tables(LUTs)-based methods have recently shown enormous potential in image restoration tasks, which are capable of significantly accelerating the inference. However, the size of LUT exhibits exponential growth with the convolution kernel size, creating a storage bottleneck for its broader application on edge devices.



Differentiable Weightless Controllers: Learning Logic Circuits for Continuous Control

Kresse, Fabian, Lampert, Christoph H.

arXiv.org Artificial Intelligence

We investigate whether continuous-control policies can be represented and learned as discrete logic circuits instead of continuous neural networks. We introduce Differentiable Weightless Controllers (DWCs), a symbolic-differentiable architecture that maps real-valued observations to actions using thermometer-encoded inputs, sparsely connected boolean lookup-table layers, and lightweight action heads. DWCs can be trained end-to-end by gradient-based techniques, yet compile directly into FPGA-compatible circuits with few- or even single-clock-cycle latency and nanojoule-level energy cost per action. Across five MuJoCo benchmarks, including high-dimensional Humanoid, DWCs achieve returns competitive with weight-based policies (full precision or quantized neural networks), matching performance on four tasks and isolating network capacity as the key limiting factor on HalfCheetah. Furthermore, DWCs exhibit structurally sparse and interpretable connectivity patterns, enabling a direct inspection of which input thresholds influence control decisions.


T-SAR: A Full-Stack Co-design for CPU-Only Ternary LLM Inference via In-Place SIMD ALU Reorganization

Oh, Hyunwoo, Nam, KyungIn, Bhattacharjya, Rajat, Chen, Hanning, Das, Tamoghno, Yun, Sanggeon, Jang, Suyeon, Ding, Andrew, Dutt, Nikil, Imani, Mohsen

arXiv.org Artificial Intelligence

Recent advances in LLMs have outpaced the computational and memory capacities of edge platforms that primarily employ CPUs, thereby challenging efficient and scalable deployment. While ternary quantization enables significant resource savings, existing CPU solutions rely heavily on memory-based lookup tables (LUTs) which limit scalability, and FPGA or GPU accelerators remain impractical for edge use. This paper presents T-SAR, the first framework to achieve scalable ternary LLM inference on CPUs by repurposing the SIMD register file for dynamic, in-register LUT generation with minimal hardware modifications. T-SAR eliminates memory bottlenecks and maximizes data-level parallelism, delivering 5.6-24.5x and 1.1-86.2x improvements in GEMM latency and GEMV throughput, respectively, with only 3.2% power and 1.4% area overheads in SIMD units. T-SAR achieves up to 2.5-4.9x the energy efficiency of an NVIDIA Jetson AGX Orin, establishing a practical approach for efficient LLM inference on edge platforms.


LL-ViT: Edge Deployable Vision Transformers with Look Up Table Neurons

Nag, Shashank, Bacellar, Alan T. L., Susskind, Zachary, Jha, Anshul, Liberty, Logan, Sivakumar, Aishwarya, John, Eugene B., Kailas, Krishnan, Lima, Priscila M. V., Yadwadkar, Neeraja J., Franca, Felipe M. G., John, Lizy K.

arXiv.org Artificial Intelligence

Vision Transformers have been tremendously successful in computer vision tasks. However, their large computational, memory, and energy demands are a challenge for edge inference on FPGAs -- a field that has seen a recent surge in demand. We recognize the benefits of recent works on logic and Look Up Table (LUT) based networks, such as LogicNets, NeuraLUT, DWN, among others, in offering models that simultaneously reduce both the memory and compute footprints. However, these models natively do not perform well on common vision tasks, such as CIFAR-10/100. In this work, we propose LL-ViT, a novel edge optimized vision transformer design that integrates layers of LUT neurons within the transformer architecture. Based on our characterization that reveals that a majority of model weights and computations are from the channel mixer (MLP layer), we design an alternate LUT-based channel mixer, and simultaneously develop an FPGA-based accelerator for LL-ViT. Contrary to some attempts to replace each multiplication with a table lookup, our architecture utilizes a neural learning approach which natively learns the LUT functions. This approach allows for reduced model sizes, and a computational and energy-efficient inference solution for vision transformer models. Evaluating on edge-suitable workloads, we achieve accuracies of 95.5% on CIFAR-10, 78.8% on CIFAR-100, and 60.9% on Tiny-ImageNet datasets, comparable to the baseline transformer. LL-ViT eliminates over 60% of the model weights and 50% of the multiplications in the model, and achieves 1.9x energy efficiency and 1.3x lower latency over an integer quantized ViT accelerator, while also offering superior throughput against prior works at a 10.9W power budget.


TinyLUT: Tiny Look-Up Table for Efficient Image Restoration at the Edge Huanan Li

Neural Information Processing Systems

Look-up tables(LUTs)-based methods have recently shown enormous potential in image restoration tasks, which are capable of significantly accelerating the inference. However, the size of LUT exhibits exponential growth with the convolution kernel size, creating a storage bottleneck for its broader application on edge devices.


Gradient Estimation Methods of Approximate Multipliers for High-Accuracy Retraining of Deep Learning Models

Meng, Chang, Burleson, Wayne, De Micheli, Giovanni

arXiv.org Artificial Intelligence

--Approximate multipliers (AppMults) are widely used in deep learning accelerators to reduce their area, delay, and power consumption. However, AppMults introduce arithmetic errors into deep learning models, necessitating a retraining process to recover accuracy. A key step in retraining is computing the gradient of the AppMult, i.e., the partial derivative of the approximate product with respect to each input operand. Existing approaches typically estimate this gradient using that of the accurate multiplier (AccMult), which can lead to suboptimal retraining results. T o address this, we propose two methods to obtain more precise gradients of AppMults. The first, called LUT -2D, characterizes the AppMult gradient with 2-dimensional lookup tables (LUTs), providing fine-grained estimation and achieving the highest retraining accuracy. The second, called LUT -1D, is a compact and more efficient variant that stores gradient values in 1-dimensional LUTs, achieving comparable retraining accuracy with shorter runtime. Experimental results show that on CIF AR-10 with convolutional neural networks, our LUT -2D and LUT -1D methods improve retraining accuracy by 3.83% and 3.72% on average, respectively. On ImageNet with vision transformer models, our LUT -1D method improves retraining accuracy by 23.69% on average, compared to a state-of-the-art retraining framework. Modern artificial intelligence ( AI) technologies excel in a wide range of areas such as natural language processing and computer vision. However, this rapid growth raises serious concerns about power consumption [1]. To achieve energy-efficient deep learning accelerators, researchers have adopted an emerging design paradigm called approximate computing, which reduces power consumption at the cost of errors [2], [3]. Approximate computing is particularly suitable for deep learning accelerators, since they are inherently resilient to errors and noise.



Mixture of Lookup Experts

Jie, Shibo, Tang, Yehui, Han, Kai, Li, Yitong, Tang, Duyu, Deng, Zhi-Hong, Wang, Yunhe

arXiv.org Artificial Intelligence

Mixture-of-Experts (MoE) activates only a subset of experts during inference, allowing the model to maintain low inference FLOPs and latency even as the parameter count scales up. However, since MoE dynamically selects the experts, all the experts need to be loaded into VRAM. Their large parameter size still limits deployment, and offloading, which load experts into VRAM only when needed, significantly increase inference latency. To address this, we propose Mixture of Lookup Experts (MoLE), a new MoE architecture that is efficient in both communication and VRAM usage. In MoLE, the experts are Feed-Forward Networks (FFNs) during training, taking the output of the embedding layer as input. Before inference, these experts can be re-parameterized as lookup tables (LUTs) that retrieves expert outputs based on input ids, and offloaded to storage devices. Therefore, we do not need to perform expert computations during inference. Instead, we directly retrieve the expert's computation results based on input ids and load them into VRAM, and thus the resulting communication overhead is negligible. Experiments show that, with the same FLOPs and VRAM usage, MoLE achieves inference speeds comparable to dense models and significantly faster than MoE with experts offloading, while maintaining performance on par with MoE.


nanoML for Human Activity Recognition

Bacellar, Alan T. L., Jadhao, Mugdha P., Nag, Shashank, Lima, Priscila M. V., Franca, Felipe M. G., John, Lizy K.

arXiv.org Artificial Intelligence

Human Activity Recognition (HAR) is critical for applications in healthcare, fitness, and IoT, but deploying accurate models on resource-constrained devices remains challenging due to high energy and memory demands. This paper demonstrates the application of Differentiable Weightless Neural Networks (DWNs) to HAR, achieving competitive accuracies of 96.34% and 96.67% while consuming only 56nJ and 104nJ per sample, with an inference time of just 5ns per sample. The DWNs were implemented and evaluated on an FPGA, showcasing their practical feasibility for energy-efficient hardware deployment. DWNs achieve up to 926,000x energy savings and 260x memory reduction compared to state-of-the-art deep learning methods. These results position DWNs as a nano-machine learning nanoML model for HAR, setting a new benchmark in energy efficiency and compactness for edge and wearable devices, paving the way for ultra-efficient edge AI.