Just as the name implies, data science is a branch of science that applies the scientific method to data with the goal of studying the relationships between different features and drawing out meaningful conclusions based on these relationships. Data is, therefore, the key component in data science. A dataset is a particular instance of data that is used for analysis or model building at any given time. A dataset comes in different flavors such as numerical data, categorical data, text data, image data, voice data, and video data. A dataset could be static (not changing) or dynamic (changes with time, for example, stock prices).
This problem statement is from the Kaggle recruitment challenge, by Allstate Insurance. Allstate is an insurance services company in the USA, which provides insurance to over 16 million households in the USA. The company wants to reduce the complexity of the insurance claiming process and make it a worry-free experience for the customers by automating the predictions of claims severity. The Allstate Insurance company wants to reduce the time taking process and make it easier for the people who need insurance cover to claim it much easier. So in order to reduce the complexity, It has given a dataset to use machine learning algorithms to predict the costs and hence the severity of the claims accurately.
Kaggle is the best place for learning from other data scientists. Many companies provide data and prize money to set up data science competitions on Kaggle. Recently I had my first shot on Kaggle and ranked 98th ( 5%) among 2125 teams. Since this is my Kaggle debut, I feel quite satisfied. Because many Kaggle beginners set 10% as their first goal, here I want to share my experience in achieving that goal. Most Kagglers use Python and R. I prefer Python, but R users should have no difficulty in understanding the ideas behind tools and languages. First let's go through some facts about Kaggle competitions in case you are not very familiar with them. Different competitions have different tasks: classification, regression, recommendation, ordering… Training set and testing set will be open for download after the competition launches.
A value count on the target label shows that only 3.5% of the transactions are labeled fraudulent. Typically, fraudulent transactions make up a small percentage of transactions. Correlation can help you understand the linear relationship between features and between features and the target. A correlation can range between -1 (perfect negative relationship) and 1 (perfect positive relationship), with 0 indicating no straight-line relationship. Visualizing the data helps with feature selection by revealing trends in the data.
Welcome to the third article in a five-part series about machine learning. In this article, we'll continue our machine learning discussion, and focus on problems associated with overfitting data, as well as controlling model complexity, a model evaluation and errors introduction, model validation and tuning, and improving model performance. Overfitting is one of the greatest concerns in predictive analytics and machine learning. Overfitting refers to a situation where the model chosen to fit the training data fits too well, and essentially captures all of the noise, outliers, and so on. The consequence of this is that the model will fit the training data very well, but will not accurately predict cases not represented by the training data, and therefore will not generalize well to unseen data.