We are interested in finding what features play a significant role in determining the sale price of the house. We are going to set a threshold value and include all the numerical features whose correlation coefficient is greater than that threshold. Since LotFrontage falls below our threshold, we choose to drop it, and ignore the NA values present in that column. MasVnrArea and GarageYrBlt have 8 and 81 missing values, respectively. We can replace the missing values in the MasVnrArea column with the median of this column.
Exploratory Data Analysis, or EDA, is essentially a type of storytelling for statisticians. It allows us to uncover patterns and insights, often with visual methods, within data. EDA is often the first step of the data modelling process. In this phase, data engineers have some questions in hand and try to validate those questions by performing EDA. EDA may sound exotic if you are new to the world of statistics.
Founded in 2010, Kaggle is a Data Science platform where users can share, collaborate, and compete. One key feature of Kaggle is "Competitions", which offers users the ability to practice on real world data and to test their skills with, and against, an international community. This guide will teach you how to approach and enter a Kaggle competition, including exploring the data, creating and engineering features, building models, and submitting predictions. We'll follow these steps to a successful Kaggle Competition submission: We need to acquire the data for the competition. Download the data and save it into a folder where you'll keep everything you need for the competition. We will first look at the train.csv After we've trained a model, we'll make predictions using the test.csv
First of all, I need to import the following libraries. Then I will read the data into a pandas Dataframe. The original dataset contains 81 columns, but for the purposes of this tutorial, I will work with a subset of 12 columns. Details about the columns can be found in the provided link to the dataset. Please note that each row of the table represents a specific house (or observation).