A blog post by Fatih Ozturk.
At UrbanStat, before we start modeling, we review the initial data extensively. We believe preprocessing is one of the most important steps of machine learning modeling. Data sets we receive from our clients are usually sourced through multiple systems, and to be able to start modeling we usually need to join all these different data sets together.
There is no silver bullet in this process, however we will mention most important steps in our risk scoring algorithm:
1 – Check Operations:
- Drop Duplicates: When we receive the training data sets we usually see duplicate records. For example, in insurance companies, when there is an endorsement or update for a policy it’s possible for them to send policy information both before the change and after the change to database and this can result in multiple same rows in time.
For python users it’s really easy to drop duplicated rows in a dataframe.
Example code:
dataframe = dataframe.drop_duplicates()
- Drop Irrelevant Parts: When you get your datasets from sources, according to your modeling purpose, you should check if data has completely irrelevant parts in it. For example, if your purpose is predicting claim probability of a 1-year long policy and got your datasets from an insurance company, you might find some policies having 3 years lifetime. So, in order to have a self-consistent modeling project you have to discard those policies having a life time different than 1 year.
Example code:
dataframe = dataframe[dataframe[‘Policy_Duration’]== 1]
- Have Consistent Columns: For numerical columns, make sure that values of them are in same range. You can have a look at statistical values of these columns to catch unexpected trends or check histogram plot to see whether there is a strange spike / outlier/ bin in the plot. For example, the data provided by the client had ‘Year_Built’ column in it and it tells the time when the building was built. The column was obviously numeric and the values were like 1979, 1992 and so on. By checking statistics of that feature we realized that there are some entries like ‘83’,’75’ in a considerable amounts. We had to fix that column since predictive models would take value of 83 as lower than any other properly-entered years (1965,1972..).
When it comes to categorical columns, especially those having not much unique values, it’s better to check their unique values to see whether columns contain strange words or typos. For example, a column might have both ‘Burglary’ and ‘burglary’ strings and both represents the same treat which is `Burglary`. So, if you don’t make both values same, this situation will probably lower your modeling performance. In order to avoid such simple problem, you can make all values of your categorical columns in lower cases. For other cases, you can also discard spaces or replace them with ‘_’.
2- Merging Multiple Datasets:
In real life, it’s uncommon to have one single data set for any objective. For example, most of insurance companies record their customers’ policy, coverage, and claims information separately. So, based on objective of a data analysis or machine learning modeling, you need to be able to merge these files by a unique column exists in each of them. Make sure that you have this key column and can make a proper joining operation.
3- Handling missing values:
Having a missing value-free data set is not always needed in machine learning models. For example if you use decision trees as model, they can handle missing values and leaving them as Null can give better performance time to time. However, for neural nets and logistic regression filling missing values is a must. So, when it comes to filling, you can treat them in two general ways.
- Categorical columns:
- If you believe being missing carries a meaningful value for a feature, then you can simply fill missing values with ‘missing’ string or anything fancy you want. Otherwise, you can fill with most frequent value in that column.
- Numerical columns:
- Fill with mean, median values of that column.
- Fill with a value which is out of range of that column. (like 999999, -999999, 0)
- Try to predict missing values by treating that column as target in a regression model.
There can be some important columns requiring more attention rather than just filling zeros and means. So, you need to be more careful as handling them. To make this clear, let me give you another example. As UrbanStat we experience missing values in Latitude and Longitude columns of client’s policy data sets. Those features are quite important for us because we join external risk data sets by these columns. When we find missing values in those columns, data is geocoded with their Address information by our data team and filled with correct Longitude and Latitude information.
Underlying reason for your validation failure or under-fitting model performance could be not cleaning your data set enough. These simple data cleaning steps could help you outperform your competitors.
To learn more and quickly leverage what we’ve already successfully deployed for our carrier partners contact us at [email protected].