RISC-V Based TinyML Accelerator for Depthwise Separable Convolutions in Edge AI

Nov-27-2025–arXiv.org Artificial Intelligence

Abstract--The increasing demand for on-device intelligence in Edge AI and TinyML applications requires the efficient execution of modern Convolutional Neural Networks (CNNs). While lightweight architectures like MobileNetV2 employ Depth-wise Separable Convolutions (DSC) to reduce computational complexity, their multi-stage design introduces a critical performance bottleneck inherent to layer-by-layer execution: the high energy and latency cost of transferring intermediate feature maps to either large on-chip buffers or off-chip DRAM. T o address this memory wall, this paper introduces a novel hardware accelerator architecture that utilizes a fused pixel-wise dataflow. Implemented as a Custom Function Unit (CFU) for a RISC-V processor, our architecture eliminates the need for intermediate buffers entirely, reducing the data movement up to 87% compared to conventional layer-by-layer execution. It computes a single output pixel to completion across all DSC stages-expansion, depthwise convolution, and projection-by streaming data through a tightly-coupled pipeline without writing to memory. Evaluated on a Xilinx Artix-7 FPGA, our design achieves a speedup of up to 59.3x over the baseline software execution on the RISC-V core. Furthermore, ASIC synthesis projects a compact 0.284 mm This work confirms the feasibility of a zero-buffer dataflow within a TinyML resource envelope, offering a novel and effective strategy for overcoming the memory wall in edge AI accelerators. Edge AI[1] involves running artificial intelligence algorithms directly on local hardware, such as sensors and Internet of Things (IoT) units, bringing computation to the source of data creation. This allows for real-time processing without constant reliance on the cloud, an approach that offers several key benefits: low latency due to local processing, enhanced privacy by keeping sensitive data on the device, and reduced network bandwidth consumption, which enables reliable of-fline operation.[2] A critical subfield of this domain is Tiny Machine Learning (TinyML)[3], which specifically focuses on deploying machine learning models directly onto low-cost, ultra-low-power microcontrollers (MCUs) and embedded systems. These devices operate under severe constraints, often with power budgets in the milliwatt range and with only a few hundred kilobytes of memory, making on-device intelligence a significant technical challenge. The typical TinyML workflow involves taking a fully trained model and optimizing it for on-device inference by applying techniques such as quantization and pruning to create a smaller, more efficient model in a compact format.

accelerator, artificial intelligence, machine learning, (17 more...)

arXiv.org Artificial Intelligence

Nov-27-2025

arXiv.org PDF

Add feedback

Country:
- Asia > Middle East
  - Republic of Türkiye
    - Ankara Province > Ankara (0.04)
    - Istanbul Province > Istanbul (0.04)
- Europe > Middle East
  - Republic of Türkiye > Istanbul Province > Istanbul (0.04)

Genre:
- Research Report (0.64)

Industry:
- Semiconductors & Electronics (0.66)

Technology:
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.48)