Goto

Collaborating Authors

 Bartolini, Andrea


Experimenting with Emerging RISC-V Systems for Decentralised Machine Learning

arXiv.org Artificial Intelligence

Decentralised Machine Learning (DML) enables collaborative machine learning without centralised input data. Federated Learning (FL) and Edge Inference are examples of DML. While tools for DML (especially FL) are starting to flourish, many are not flexible and portable enough to experiment with novel processors (e.g., RISC-V), non-fully connected network topologies, and asynchronous collaboration schemes. We overcome these limitations via a domain-specific language allowing us to map DML schemes to an underlying middleware, i.e. the FastFlow parallel programming library. We experiment with it by generating different working DML schemes on x86-64 and ARM platforms and an emerging RISC-V one. We characterise the performance and energy efficiency of the presented schemes and systems. As a byproduct, we introduce a RISC-V porting of the PyTorch framework, the first publicly available to our knowledge.


DECICE: Device-Edge-Cloud Intelligent Collaboration Framework

arXiv.org Artificial Intelligence

DECICE is a Horizon Europe project that is developing an AI-enabled open and portable management framework for automatic and adaptive optimization and deployment of applications in computing continuum encompassing from IoT sensors on the Edge to large-scale Cloud / HPC computing infrastructures. In this paper, we describe the DECICE framework and architecture. Furthermore, we highlight use-cases for framework evaluation: intelligent traffic intersection, magnetic resonance imaging, and emergency response.


Anomaly Detection using Autoencoders in High Performance Computing Systems

arXiv.org Artificial Intelligence

Anomaly detection in supercomputers is a very difficult problem due to the big scale of the systems and the high number of components. The current state of the art for automated anomaly detection employs Machine Learning methods or statistical regression models in a supervised fashion, meaning that the detection tool is trained to distinguish among a fixed set of behaviour classes (healthy and unhealthy states). We propose a novel approach for anomaly detection in High Performance Computing systems based on a Machine (Deep) Learning technique, namely a type of neural network called autoencoder. The key idea is to train a set of autoencoders to learn the normal (healthy) behaviour of the supercomputer nodes and, after training, use them to identify abnormal conditions. This is different from previous approaches which where based on learning the abnormal condition, for which there are much smaller datasets (since it is very hard to identify them to begin with). We test our approach on a real supercomputer equipped with a fine-grained, scalable monitoring infrastructure that can provide large amount of data to characterize the system behaviour. The results are extremely promising: after the training phase to learn the normal system behaviour, our method is capable of detecting anomalies that have never been seen before with a very good accuracy (values ranging between 88% and 96%).


Robust identification of thermal models for in-production High-Performance-Computing clusters with machine learning-based data selection

arXiv.org Machine Learning

Power and thermal management are critical components of High-Performance-Computing (HPC) systems, due to their high power density and large total power consumption. The assessment of thermal dissipation by means of compact models directly from the thermal response of the final device enables more robust and precise thermal control strategies as well as automated diagnosis. However, when dealing with large scale systems "in production", the accuracy of learned thermal models depends on the dynamics of the power excitation, which depends also on the executed workload, and measurement nonidealities, such as quantization. In this paper we show that, using an advanced system identification algorithm, we are able to generate very accurate thermal models (average error lower than our sensors quantization step of 1{\deg}C) for a large scale HPC system on real workloads for very long time periods. However, we also show that: 1) not all real workloads allow for the identification of a good model; 2) starting from the theory of system identification it is very difficult to evaluate if a trace of data leads to a good estimated model. We then propose and validate a set of techniques based on machine learning and deep learning algorithms for the choice of data traces to be used for model identification. We also show that deep learning techniques are absolutely necessary to correctly choose such traces up to 96% of the times.


Optimization and Controlled Systems: A Case Study on Thermal Aware Workload Dispatching

AAAI Conferences

Although successfully employed on many industrial problems, Combinatorial Optimization still has limited applicability on several real-world domains, often due to modeling difficulties. This is typically the case for systems under the control of an on-line policy: even when the policy itself is well known, capturing its effect on the system in a declarative model is often impossible by conventional means. Such a difficulty is at the root of the classical, sharp separation between off- line and on-line approaches. In this paper, we investigate a general method to model controlled systems, based on the integration of Machine Learning and Constraint Programming (CP). Specifically, we use an Artificial Neural Network (ANN) to learn the behavior of a controlled system (a multicore CPU with thermal con- trollers) and plug it into a CP model by means of Neuron Constraints. The method obtains significantly better results compared to an approach with no ANN guidance. Neuron Constraints were first introduced in [Bartolini et al., 2011b] as a mean to model complex systems: providing evidence of their applicability to controlled systems is a significant step forward, broadening the application field of combinatorial methods and disclosing opportunities for hybrid off-line/on-line optimization.