Goto

Collaborating Authors

 ml model training


Towards Cooperative Federated Learning over Heterogeneous Edge/Fog Networks

Wang, Su, Hosseinalipour, Seyyedali, Aggarwal, Vaneet, Brinton, Christopher G., Love, David J., Su, Weifeng, Chiang, Mung

arXiv.org Artificial Intelligence

Federated learning (FL) has been promoted as a popular technique for training machine learning (ML) models over edge/fog networks. Traditional implementations of FL have largely neglected the potential for inter-network cooperation, treating edge/fog devices and other infrastructure participating in ML as separate processing elements. Consequently, FL has been vulnerable to several dimensions of network heterogeneity, such as varying computation capabilities, communication resources, data qualities, and privacy demands. We advocate for cooperative federated learning (CFL), a cooperative edge/fog ML paradigm built on device-to-device (D2D) and device-to-server (D2S) interactions. Through D2D and D2S cooperation, CFL counteracts network heterogeneity in edge/fog networks through enabling a model/data/resource pooling mechanism, which will yield substantial improvements in ML model training quality and network resource consumption. We propose a set of core methodologies that form the foundation of D2D and D2S cooperation and present preliminary experiments that demonstrate their benefits. We also discuss new FL functionalities enabled by this cooperative framework such as the integration of unlabeled data and heterogeneous device privacy into ML model training. Finally, we describe some open research directions at the intersection of cooperative edge/fog and FL.


Multi-Edge Server-Assisted Dynamic Federated Learning with an Optimized Floating Aggregation Point

Ganguly, Bhargav, Hosseinalipour, Seyyedali, Kim, Kwang Taik, Brinton, Christopher G., Aggarwal, Vaneet, Love, David J., Chiang, Mung

arXiv.org Artificial Intelligence

We propose cooperative edge-assisted dynamic federated learning (CE-FL). CE-FL introduces a distributed machine learning (ML) architecture, where data collection is carried out at the end devices, while the model training is conducted cooperatively at the end devices and the edge servers, enabled via data offloading from the end devices to the edge servers through base stations. CE-FL also introduces floating aggregation point, where the local models generated at the devices and the servers are aggregated at an edge server, which varies from one model training round to another to cope with the network evolution in terms of data distribution and users' mobility. CE-FL considers the heterogeneity of network elements in terms of communication/computation models and the proximity to one another. CE-FL further presumes a dynamic environment with online variation of data at the network devices which causes a drift at the ML model performance. We model the processes taken during CE-FL, and conduct analytical convergence analysis of its ML model training. We then formulate network-aware CE-FL which aims to adaptively optimize all the network elements via tuning their contribution to the learning process, which turns out to be a non-convex mixed integer problem. Motivated by the large scale of the system, we propose a distributed optimization solver to break down the computation of the solution across the network elements. We finally demonstrate the effectiveness of our framework with the data collected from a real-world testbed.


Best MLOps workflow to upscale ML lifecycles

#artificialintelligence

The machine learning life cycle is a cyclical process that data science initiatives must go through. Machine learning encompasses a wide range of disciplines, from business jobs to data scientists and DevOps. The life cycle specifies each step that an organization/individual should take to extract tangible commercial value from machine learning. A detailed grasp of the ML model development life cycle will allow you to properly manage resources and acquire a better idea of where you stand in the process. MLOps, an abbreviation for Machine Learning Operations, is a key stage in the design of a data science project.


MLOps & Machine Learning Pipeline Explained - Medi-AI

#artificialintelligence

MLOps is a compound term that combines "machine learning" and "operations." The role of MLOps, then, is to provide a communication conduit between data scientists who work with machine learning data and the operations team that manages the project. To do so, MLOps applies the type of cloud-native applications used in DevOps to machine learning (ML) services, specifically continuous integration/continuous deployment (CI/CD). Although both ML and normal cloud-native apps are written in (ok, result in) software, there is more to ML services than just code. While cloud-native apps require source version control, automated unit-/load -testing, AB testing, and final deployment, MLOps uses a data pipeline, ML model training, and more complex deployment with special purpose logging-monitoring capabilities.


AWS Sagemaker Workflow Management with Airflow

#artificialintelligence

In this article, I will talk about my experience on scheduling data science project's notebooks on AWS Sagemaker instances using Airflow. We have been using Netflix's papermill library to run Jupyter notebooks more than 2 years now in production and everyday 10s of Sagemaker Notebook instances are orchestrated by Airflow working like a charm. You will read about the general architectural design of this system, what is the way of working, what are the roles and responsibilities between teams and how you can implement it yourself. It all started with me reading this article on Netflix blog about running jupyter notebook files with external parameters for productionizing data science workloads. This could be the solution to a common problem which I faced in my previous company, we were running Apache Spark applications using pyspark and other python code for data science and reporting projects on AWS EMR.


Reducing bias in ML model training

#artificialintelligence

Stratified random sampling is a method of sampling that involves the division of a population into smaller sub-groups known as strata. Then simple random sampling with replacement is applied within each stratum. Members in each of these sub-groups should be distinct so that every member of all groups get equal opportunity to be selected using simple probability. This sampling method is also called proportional random sampling. Stratified random sampling is a method of sampling that involves the division of a population into smaller sub-groups known as strata.


10 Machine Learning Model Training Mistakes - AI Summary

#artificialintelligence

By Sandeep Uttamchandani, Ph.D., Both a Product/Software Builder (VP of Engg) & Leader in operating enterprise-wide Data/AI initiatives (CDO) In this article, I share the ten deadly sins during ML model training -- these are the most common as well as the easiest to overlook. During model training, there are scenarios when the loss-epoch graph keeps bouncing around and does not seem to converge irrespective of the number of epochs. There is no silver bullet as there are multiple root causes to investigate -- bad training examples, missing truths, changing data distributions, too high a learning rate. The most common one I have seen is bad training examples related to a combination of anomalous data and incorrect labels. The more the same data is used for parameter and hyperparameter settings, the lesser confidence that the results will actually generalize.


10 Deadly Sins of ML Model Training

#artificialintelligence

During model training, there are scenarios when the loss-epoch graph keeps bouncing around and does not seem to converge irrespective of the number of epochs. There is no silver bullet as there are multiple root causes to investigate -- bad training examples, missing truths, changing data distributions, too high a learning rate. The most common one I have seen is bad training examples related to a combination of anomalous data and incorrect labels. Sometimes there are scenarios where the model seems to be converging, but suddenly the loss value increases significantly, i.e., loss value reduces and then increases significantly with epochs. There are multiple reasons for this kind of exploding loss.


Improve Your ML Models Training

#artificialintelligence

Deep learning has found its way into all kinds of research areas in the present times and has also become an integral part of our lives. "Artificial Intelligence is the new electricity." However, with any great technical breakthroughs come a large number of challenges too. This article will help you solve one of these hurdles, which is optimization. We know that if we set the learning rate too small, the algorithm will take too much time to converge fully, and if it's too large, the algorithm will diverge instead of converging.