ram usage
TinyDéjàVu: Smaller Memory Footprint & Faster Inference on Sensor Data Streams with Always-On Microcontrollers
Huang, Zhaolan, Baccelli, Emmanuel
Always-on sensors are increasingly expected to embark a variety of tiny neural networks and to continuously perform inference on time-series of the data they sense. In order to fit lifetime and energy consumption requirements when operating on battery, such hardware uses microcontrollers (MCUs) with tiny memory budget e.g., 128kB of RAM. In this context, optimizing data flows across neural network layers becomes crucial. In this paper, we introduce TinyDéjàVu, a new framework and novel algorithms we designed to drastically reduce the RAM footprint required by inference using various tiny ML models for sensor data time-series on typical microcontroller hardware. We publish the implementation of TinyDéjàVu as open source, and we perform reproducible benchmarks on hardware. We show that TinyDéjàVu can save more than 60% of RAM usage and eliminate up to 90% of redundant compute on overlapping sliding window inputs.
Unsupervised Anomaly Detection for Smart IoT Devices: Performance and Resource Comparison
Sami, Md. Sad Abdullah, Abid, Mushfiquzzaman
The rapid expansion of Internet of Things (IoT) deployments across diverse sectors has significantly enhanced operational efficiency, yet concurrently elevated cybersecurity vulnerabilities due to increased exposure to cyber threats. Given the limitations of traditional signature-based Anomaly Detection Systems (ADS) in identifying emerging and zero-day threats, this study investigates the effectiveness of two unsupervised anomaly detection techniques, Isolation Forest (IF) and One-Class Support Vector Machine (OC-SVM), using the TON_IoT thermostat dataset. A comprehensive evaluation was performed based on standard metrics (accuracy, precision, recall, and F1-score) alongside critical resource utilization metrics such as inference time, model size, and peak RAM usage. Experimental results revealed that IF consistently outperformed OC-SVM, achieving higher detection accuracy, superior precision, and recall, along with a significantly better F1-score. Furthermore, Isolation Forest demonstrated a markedly superior computational footprint, making it more suitable for deployment on resource-constrained IoT edge devices. These findings underscore Isolation Forest's robustness in high-dimensional and imbalanced IoT environments and highlight its practical viability for real-time anomaly detection.
Efficient Chromosome Parallelization for Precision Medicine Genomic Workflows
Montserrat, Daniel Mas, Verma, Ray, Barrabés, Míriam, de la Vega, Francisco M., Bustamante, Carlos D., Ioannidis, Alexander G.
Large-scale genomic workflows used in precision medicine can process datasets spanning tens to hundreds of gigabytes per sample, leading to high memory spikes, intensive disk I/O, and task failures due to out-of-memory errors. Simple static resource allocation methods struggle to handle the variability in per-chromosome RAM demands, resulting in poor resource utilization and long runtimes. In this work, we propose multiple mechanisms for adaptive, RAM-efficient par-allelization of chromosome-level bioinformatics workflows. First, we develop a symbolic regression model that estimates per-chromosome memory consumption for a given task and introduces an interpolating bias to conservatively minimize over-allocation. Second, we present a dynamic scheduler that adaptively predicts RAM usage with a polynomial regression model, treating task packing as a Knapsack problem to optimally batch jobs based on predicted memory requirements. Additionally, we present a static scheduler that optimizes chromosome processing order to minimize peak memory while preserving throughput. Our proposed methods, evaluated on simulations and real-world genomic pipelines, provide new mechanisms to reduce memory overruns and balance load across threads. We thereby achieve faster end-to-end execution, showcasing the potential to optimize large-scale genomic workflows.
msf-CNN: Patch-based Multi-Stage Fusion with Convolutional Neural Networks for TinyML
Huang, Zhaolan, Baccelli, Emmanuel
AI spans from large language models to tiny models running on microcontrollers (MCUs). Extremely memory-efficient model architectures are decisive to fit within an MCU's tiny memory budget e.g., 128kB of RAM. However, inference latency must remain small to fit real-time constraints. An approach to tackle this is patch-based fusion, which aims to optimize data flows across neural network layers. In this paper, we introduce msf-CNN, a novel technique that efficiently finds optimal fusion settings for convolutional neural networks (CNNs) by walking through the fusion solution space represented as a directed acyclic graph. Compared to previous work on CNN fusion for MCUs, msf-CNN identifies a wider set of solutions. We published an implementation of msf-CNN running on various microcontrollers (ARM Cortex-M, RISC-V, ESP32). We show that msf-CNN can achieve inference using 50% less RAM compared to the prior art (MCUNetV2 and StreamNet). We thus demonstrate how msf-CNN offers additional flexibility for system designers.
A2Perf: Real-World Autonomous Agents Benchmark
Uchendu, Ikechukwu, Jabbour, Jason, Berghe, Korneel Van den, Runevic, Joel, Stewart, Matthew, Ma, Jeffrey, Krishnan, Srivatsan, Gur, Izzeddin, Huang, Austin, Bishop, Colton, Bailey, Paige, Jiang, Wenjie, Songhori, Ebrahim M., Guadarrama, Sergio, Tan, Jie, Terry, Jordan K., Faust, Aleksandra, Reddi, Vijay Janapa
Autonomous agents and systems cover a number of application areas, from robotics and digital assistants to combinatorial optimization, all sharing common, unresolved research challenges. It is not sufficient for agents to merely solve a given task; they must generalize to out-of-distribution tasks, perform reliably, and use hardware resources efficiently during training and inference, among other requirements. Several methods, such as reinforcement learning and imitation learning, are commonly used to tackle these problems, each with different trade-offs. However, there is a lack of benchmarking suites that define the environments, datasets, and metrics which can be used to provide a meaningful way for the community to compare progress on applying these methods to real-world problems. We introduce A2Perf--a benchmark with three environments that closely resemble real-world domains: computer chip floorplanning, web navigation, and quadruped locomotion. A2Perf provides metrics that track task performance, generalization, system resource efficiency, and reliability, which are all critical to real-world applications. Using A2Perf, we demonstrate that web navigation agents can achieve latencies comparable to human reaction times on consumer hardware, reveal reliability trade-offs between algorithms for quadruped locomotion, and quantify the energy costs of different learning approaches for computer chip-design. In addition, we propose a data cost metric to account for the cost incurred acquiring offline data for imitation learning and hybrid algorithms, which allows us to better compare these approaches. A2Perf also contains several standard baselines, enabling apples-to-apples comparisons across methods and facilitating progress in real-world autonomy. As an open-source benchmark, A2Perf is designed to remain accessible, up-to-date, and useful to the research community over the long term.
EdgeMark: An Automation and Benchmarking System for Embedded Artificial Intelligence Tools
Hasanpour, Mohammad Amin, Kirkegaard, Mikkel, Fafoutis, Xenofon
The integration of artificial intelligence (AI) into embedded devices, a paradigm known as embedded artificial intelligence (eAI) or tiny machine learning (TinyML), is transforming industries by enabling intelligent data processing at the edge. However, the many tools available in this domain leave researchers and developers wondering which one is best suited to their needs. This paper provides a review of existing eAI tools, highlighting their features, trade-offs, and limitations. Additionally, we introduce EdgeMark, an open-source automation system designed to streamline the workflow for deploying and benchmarking machine learning (ML) models on embedded platforms. EdgeMark simplifies model generation, optimization, conversion, and deployment while promoting modularity, reproducibility, and scalability. Experimental benchmarking results showcase the performance of widely used eAI tools, including TensorFlow Lite Micro (TFLM), Edge Impulse, Ekkono, and Renesas eAI Translator, across a wide range of models, revealing insights into their relative strengths and weaknesses. The findings provide guidance for researchers and developers in selecting the most suitable tools for specific application requirements, while EdgeMark lowers the barriers to adoption of eAI technologies.
Comparing neural network training performance between Elixir and Python
Tavano, Lucas C., Amin, Lucas K., Serra-Seca-Neto, Adolfo Gustavo
With a wide range of libraries focused on the machine learning market, such as TensorFlow, NumPy, Pandas, Keras, and others, Python has made a name for itself as one of the main programming languages. In February 2021, Jos\'e Valim and Sean Moriarity published the first version of the Numerical Elixir (Nx) library, a library for tensor operations written in Elixir. Nx aims to allow the language be a good choice for GPU-intensive operations. This work aims to compare the results of Python and Elixir on training convolutional neural networks (CNN) using MNIST and CIFAR-10 datasets, concluding that Python achieved overall better results, and that Elixir is already a viable alternative.
iRobot's Experience in Running ROS2 on Linux-Based Embedded Platforms
During ROSCon 2019 Alberto Soragna, Juan Oxoby, and Dhiraj Goel from iRobot presented their experience in using Robot Operating System 2 (ROS 2) on a low-cost embedded platform. By experimenting with different Data Distribution Service (DDS) implementations they reduced the CPU and memory usage of their application, which improved performance. As iRobot is creating consumer robots with low-cost embedded platforms in them, they investigated the use of ROS 2 on their embedded hardware. ROS 2 works well on desktop computers and on microcontrollers, but it is more difficult to use ROS on small Linux computers. A core challenge that iRobot faced during development is that they have approximately 1000 robots connected to the same network during their prototyping phase.