Collaborating Authors

Data Cleaning

Machine Learning with Scikit-learn


This blog provides an overview of how to build a Machine Learning model with details on various aspects such as data pre-processing, splitting the training and testing data, regression/classification, and finally model evaluation. Machine Learning (ML) is a branch of artificial intelligence based on the idea that systems can learn from data, identify patterns, and make decisions. ML systems are trained rather than explicitly programmed. It provides efficient tools for data analysis, data pre-processing, model building, model evaluation, and much more. So in this blog we will implement various ML models with the help of Scikit learn(sk-learn), which is a simple open-source Machine Learning library.

What Is Data Preparation in a Machine Learning Project


Data preparation may be one of the most difficult steps in any machine learning project. The reason is that each dataset is different and highly specific to the project. Nevertheless, there are enough commonalities across predictive modeling projects that we can define a loose sequence of steps and subtasks that you are likely to perform. This process provides a context in which we can consider the data preparation required for the project, informed both by the definition of the project performed before data preparation and the evaluation of machine learning algorithms performed after. In this tutorial, you will discover how to consider data preparation as a step in a broader predictive modeling machine learning project.

Why Data Preparation Is So Important in Machine Learning


On a predictive modeling project, machine learning algorithms learn a mapping from input variables to a target variable. The most common form of predictive modeling project involves so-called structured data or tabular data. This is data as it looks in a spreadsheet or a matrix, with rows of examples and columns of features for each example. We cannot fit and evaluate machine learning algorithms on raw data; instead, we must transform the data to meet the requirements of individual machine learning algorithms. More than that, we must choose a representation for the data that best exposes the unknown underlying structure of the prediction problem to the learning algorithms in order to get the best performance given our available resources on a predictive modeling project. Given that we have standard implementations of highly parameterized machine learning algorithms in open source libraries, fitting models has become routine.

Managing Data through the Lens of an Ontology

AI Magazine

While the amount of data stored in current information systems continuously grows, and the processes making use of such data become more and more complex, extracting knowledge and getting insights from these data, as well as governing both data and the associated processes, are still challenging tasks. The problem is complicated by the proliferation of data sources and services both within a single organization, and in cooperating environments. Effectively accessing, integrating and managing data in complex organizations is still one of the main issues faced by the information technology industry today. Indeed, it is not surprising that data scientists spend a comparatively large amount of time in the data preparation phase of a project, compared with the data minining and knowledge discovery phase. Whether you call it data wrangling, data munging, or data integration, it is estimated that 50%-80% of a data scientists time is spent on collecting and organizing data for analysis.

Basic Data Cleaning for Machine Learning (That You Must Perform)


Data cleaning is a critically important step in any machine learning project. In tabular data, there are many different statistical analysis and data visualization techniques you can use to explore your data in order to identify data cleaning operations you may want to perform. Before jumping to the sophisticated methods, there are some very basic data cleaning operations that you probably should perform on every single machine learning project. These are so basic that they are often overlooked by seasoned machine learning practitioners, yet are so critical that if skipped, models may break or report overly optimistic performance results. In this tutorial, you will discover basic data cleaning you should always perform on your dataset.

Data Cleansing for Models Trained with SGD

Neural Information Processing Systems

Data cleansing is a typical approach used to improve the accuracy of machine learning models, which, however, requires extensive domain knowledge to identify the influential instances that affect the models. In this paper, we propose an algorithm that can identify influential instances without using any domain knowledge. The proposed algorithm automatically cleans the data, which does not require any of the users' knowledge. Hence, even non-experts can improve the models. The existing methods require the loss function to be convex and an optimal model to be obtained, which is not always the case in modern machine learning.

Nonconvex Low-Rank Tensor Completion from Noisy Data

Neural Information Processing Systems

We study a completion problem of broad practical interest: the reconstruction of a low-rank symmetric tensor from highly incomplete and randomly corrupted observations of its entries. While a variety of prior work has been dedicated to this problem, prior algorithms either are computationally too expensive for large-scale applications, or come with sub-optimal statistical guarantees. Focusing on incoherent'' and well-conditioned tensors of a constant CP rank, we propose a two-stage nonconvex algorithm --- (vanilla) gradient descent following a rough initialization --- that achieves the best of both worlds. Specifically, the proposed nonconvex algorithm faithfully completes the tensor and retrieves all low-rank tensor factors within nearly linear time, while at the same time enjoying near-optimal statistical guarantees (i.e. The insights conveyed through our analysis of nonconvex optimization might have implications for other tensor estimation problems.

Data Wrangling in Pandas for Machine Learning Engineers


"The course is really impressive. Tons of information, and I learned a great deal. I had no Python background, and now I feel a lot more confident about working with Python than ever. "Honestly Mike your classes speak for themselves. They're informative, concise and just really well put together.

How to Clean Machine Learning Datasets Using Pandas ActiveState


The first step in any machine learning project is typically to clean your data by removing unnecessary data points, inconsistencies and other issues that could prevent accurate analytics results. Data cleansing can comprise up to 80% of the effort in your project, which may seem intimidating (and it certainly is if you attempt to do it by hand), but it can be automated. In this post, we'll walk through how to clean a dataset using Pandas, a Python open source data analysis library included in ActiveState's Python. All the code in this post can be found in my Github repository. If you already have Python installed, you can skip this step.

Polly: A Tool for Rapid Data Integration and Analysis in Support of Agricultural Research and Education


Data analysis and modeling is a complex and demanding task. While a variety of software and tools exist to cope with this problem and tame big data operations, most of these tools are either not free, and when they are, they require large amount of configuration and steep learning curve. Moreover, they provide limited functionalities. In this paper we propose Polly, an online data analysis and modeling open-source tool that is intuitive to use and can be used with minimal or no configuration. Users can use Polly to rapidly integrate, analyze their data, prototype and test their novel methodologies. Polly can be used also as an educational tool. Users can use Polly to upload or connect to their structured data sources, load the required data into our system and perform various data processing tasks. Examples of such operations include data cleaning, data pre-processing, attribute encoding, regression and classification analysis.