Dealing with categorical features in machine learning
Categorical data are commonplace in many Data Science and Machine Learning problems but are usually more challenging to deal with than numerical data. In particular, many machine learning algorithms require that their input is numerical and therefore categorical features must be transformed into numerical features before we can use any of these algorithms. One of the most common ways to make this transformation is to one-hot encode the categorical features, especially when there does not exist a natural ordering between the categories (e.g. a feature'City' with names of cities such as'London', 'Lisbon', 'Berlin', etc.). For each unique value of a feature (say, 'London') one column is created (say, 'City_London') where the value is 1 if for that instance the original feature takes that value and 0 otherwise. Even though this type of encoding is used very frequently, it can be frustrating to try to implement it using scikit-learn in Python, as there isn't currently a simple transformer to apply, especially if you want to use it as a step of your machine learning pipeline.
Jul-17-2019, 13:03:35 GMT
- Technology: