Goto

Collaborating Authors

 custom pipeline


Impact of Comprehensive Data Preprocessing on Predictive Modelling of COVID-19 Mortality

Das, Sangita, Maji, Subhrajyoti

arXiv.org Artificial Intelligence

Accurate predictive models are crucial for analysing COVID-19 mortality trends. This study evaluates the impact of a custom data preprocessing pipeline on ten machine learning models predicting COVID-19 mortality using data from Our World in Data (OWID). Our pipeline differs from a standard preprocessing pipeline through four key steps. Firstly, it transforms weekly reported totals into daily updates, correcting reporting biases and providing more accurate estimates. Secondly, it uses localised outlier detection and processing to preserve data variance and enhance accuracy. Thirdly, it utilises computational dependencies among columns to ensure data consistency. Finally, it incorporates an iterative feature selection process to optimise the feature set and improve model performance. Results show a significant improvement with the custom pipeline: the MLP Regressor achieved a test RMSE of 66.556 and a test R-squared of 0.991, surpassing the DecisionTree Regressor from the standard pipeline, which had a test RMSE of 222.858 and a test R-squared of 0.817. These findings highlight the importance of tailored preprocessing techniques in enhancing predictive modelling accuracy for COVID-19 mortality. Although specific to this study, these methodologies offer valuable insights into diverse datasets and domains, improving predictive performance across various contexts.


Two Towers Model: A Custom Pipeline in Vertex AI Using Kubeflow

#artificialintelligence

MLOps is composed by Continuous Integration (CI -- code, unit testing, remerge code), Continuous Delivery (CD -- build, test, release) and Continuous Training (CT -- train, monitor, measure, retrain, serve). Consider the following situation: you develop a solution where you will offer product search for users. There are new users every minute and new products every day. In this situation we will have an index of embeddings containing all the products, and users query will be submitted as numerical vectors to this index, to check for the best results. This index is deployed in a container inside Vertex AI endpoints.


spaCy Version 3.0 Released: All Features & Specifications

#artificialintelligence

The 3.0 version has state of the art transformer-based pipelines and pre-trained models in seventeen languages. The first version of spaCy was a preliminary version with little support for deep-learning workflows. The second version, however, introduced convoluted neural network models in seven different languages. The third version is a massive improvement over both of these versions. The 3.0 version has completed dropped support for Python 2 and only works on Python 3.6.