Python - Working with predictive analytics

We will look at how the companies predict sales, real estate agents predict housing prices or insurance companies predict the healthcare costs.

Predictive analytics roadmap

Let’s look at the Cross Industry Standard process for Data Mining or in short CRISP-DM. The below is the logical steps in order to follow CRISP-DM.

  1. Business Understanding
  2. Data Understanding
  3. Data Preparation
  4. Modeling
  5. Evaluation
  6. Deployment

Business Understanding

First we need to define the goal. What is our business and what are we trying to predict. Example, How many number of products will we sell?

Data Understanding

Now once you have defined your concrete problem statement, you need to have a through understanding of the data you are working with. In this stage you have to make sure you have all the required data to function. If not then acquiring the data or analyzing any alternative must be done in this stage.

Data Preparation

Now once you have collected all the necessary data that will be needed to build the model you will need to do clean up of the data and prepare the data ready for the model to run. This includes accounting for missing values, removing outliers, converting to categorical data to one hot encoding, feature scaling, extracting necessary columns etc..

Modeling

Once you have got the data prepared, you need to split the data into training and test set. And once that is done you can feed the data into the algorithms and then use the test data to test your trained model.

Evaluation

Once you have your model trained you have to evaluate your model using the test data. There are many evaluation metrics. And many times it might not be a perfectly trained model so you may end up going back and forth between Modeling and the Evaluation stages.

Deployment

Once we get the right trained model we then find a way to successfully deploy the model and start using the model in real world scenario.

Note: Many times there is also a human factor involved in analyzing the model accuracy. Things like subject matter knowledge etc. So the evaluation metrics alone cant be everything.

The data can be of major 2 types which can also be further broken down,

  1. Numerical

    1. Interval
    2. Ratio
  2. Categorical

    1. Nominal
    2. Ordinal

Depending upon the operations we can perform on the data we can categorize them. For example, if we need to predict how many bulldozers we need to remove trees in an area for a landscaping business and things like what paint color we need and other equipments to decorate the area. Then we will be requiring data such as Color of the house, Door numbers of the house, Temperature of the house, Area of the house.

The Color of the house can be blue, red or green which is a Categorical: Nominal data as it cannot be added, multiplied, compared(< | >), or ordered.

The Door numbers of a house can be #1201, #1202, #1203 which comes under Categorical: Ordinal which cannot be added/subtracted or divided/multiplied because it wont make any sense, But it can be ordered in ascending or descending order.

The Temperature of the house can be 23Deg, 24Deg, -3Deg which can be added/subtracted, compared, or ordered. But cannot be multiplied/divided because they are not on the same scale meaning all positive numbers and hence it comes under Numerical: Interval. For example 0Deg is a valid data point and division by 0 does not make any sense.

The Area of a house can be 800sq.ft, 1200sq.ft, 2400sq.ft which can be divided and multiplied as well because, there cannot be a 0sq.ft house or -200sq.ft house. Hence it comes under Numerical: Ratio.

Below is a summarization of various data types with the various operations that can be performed on them.

Data == or != < or > + or - * or /
Nominal Yes No No No
Ordinal Yes Yes No No
Interval Yes Yes Yes No
Ratio Yes Yes Yes Yes

Prediction models need numbers hence we need to convert Categorical Data into numbers.

import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer

import matplotlib.pyplot as plt # plotting library

data = pd.read_csv('./insurance.csv') # Load the insurance dataset
print(data.head(15)) # Print the top 15 rows of data from the loaded dataset

# Check how many missing values (NaN) are in each column
count_nan = data.isnull().sum()
# Print the column names which contains NaN
print(count_nan[count_nan > 0])

# Filling in the missing values using the mean of the column
data['bmi'].fillna(data['bmi'].mean(), inplace=True) # The inplace true will change inplace so that the value in the data frame itself changes.

# Check if there are any NaN still present
count_nan = data.isnull().sum()
# Print the column names which contains NaN
print(count_nan[count_nan > 0])

To convert Categorical Data into Numbers we can do 2 things,

  1. Label Encoding - Two distinct values (Binomial)
  2. One Hot Encoding - Three or more distinct values

LabelEncoding

We will be using nd arrays to do the LabelEncoding but it can also be done using pandas Series.

from sklearn.preprocessing import LabelEndoder

# create nd array for LabelEncoding
sex = data.iloc[:, 1:2].values
smoker = data.iloc[:, 4:5].values

# perform LabelEncoding for sex
le = LabelEncoder()
sex[:, 0] = le.fit_transform(sex[:, 0])
sex = pd.DataFrame(sex)
sex.columns = ['sex']
le_sex_mapping = dict(zip(le.classes_, le,.transform(le.classes_)))
print("sklearn label encoder results for sex")
print(le_sex_mapping)
print(sex[:10])

# perform LabelEncoding for smoker
le = LabelEncoder()
smoker[:, 0] = le.fit_transform(smoker[:, 0])
smoker = pd.DataFrame(smoker)
smoker.columns = ['smoker']
le_smoker_mapping = dict(zip(le.classes_, le,.transform(le.classes_)))
print("sklearn label encoder results for smoker")
print(le_smoker_mapping)
print(smoker[:10])

One Hot Encoding

We will be using nd arrays to do the OneHotEncoding but it can also be done using pandas Series.

from sklearn.preprocessing import OneHotEncoding


# create nd array for OneHotEncoding
region = data.iloc[:, 5:6].values

# perform OneHotEncoding for region
ohe = OneHotEncoding()
region = ohe.fit_transform(region).toarray()
region = pd.DataFrame(region)
region.columns = ['northeast', 'northwest', 'southeast', 'southwest']
print("sklearn one hot encoding results for region")
print(region[:10])

Train test split

Once we have built our individual DataFrames as we want we might need to combine them into a single clubbed DataFrame and then split the DataFrame into train and test split. The train split will be used in training the model and the test split will be used to evaluate our trained model. We would also need to define the independent(features) and dependent(output) variable in the DataFrame.

# Take the numerical data from the original data
X_num = data[['age', 'bmi', 'children']]
# Take the encoded data and add it to the numerical data
X_final = pd.concat([X_num, sex, smoker, region], axis=1)
# Take the charges column from the original data and assign it as the y_final
y_final = data[['charges']].copy()
# Do the train test split
X_train, X_test, y_train, y_test = train_test_split(X_final, y_final, test_size=0.33, random_state=0)

FeatureScaling

The values in each column might be in different ranges which means the model will see each columns as being higher or lower than the other columns, which wont be the case. So we need to bring the numerical values in all the columns in the same scale. This process is called FeatureScaling and there are 2 types of FeatureScaling, they are Normalization and Standardization.

By doing FeatureScaling we primarily achieve 2 things.

  1. We make the features with smaller values play a larger role in the model.
  2. We dilute the biases coming from the outliers.

Normalization

This method is affected more against the outliers.

z=xmin(x)[max(x)min(x)]z=\frac{x - \text{min}(x)}{[\text{max}(x) - \text{min}(x)]}
from sklearn.preprocessing import MinMaxScaler

n_scaler = MinMaxScaler()
X_train = n_scaler.fit_transform(X_train.astype(np.float))
X_test = n_scaler.transform(X_test.astype(np.float))

Standardization

This method is affected less against the outliers.

z=xμσz=\frac{x - \mu}{\sigma}

μ=Mean\mu = \text{Mean} σ=Standard Deviation\sigma = \text{Standard Deviation}

from sklearn.preprocessing import StandardScaler

s_scaler = StandardScaler()
X_train = s_scaler.fit_transform(X_train.astype(np.float))
X_test = s_scaler.transform(X_test.astype(np.float))

Modeling

There are 3 kinds of Modeling.

  1. Supervised Learning

    1. Regression
    2. Classification
  2. Unsupervised Learning
  3. Reinforcement Learning

Supervised Learning

Regression

Regression predicts a numerical value. Example: House sales website predicts the price of the house. Use regression if the predicted output in numerical.

Classification

Classification predicts a categorical variable. Example: Bank categorize if an expense is a fraud or not. Use classification if the predicted output is Category.

Unsupervised Learning

Clustering

Discover the inherent groupings in the data. Example: Analyze Customer behavior by grouping customers by their purchasing behavior without prior knowledge.

Association

Discover rules that describe portions of the data. Example: Analyze Customer behavior as in Customers who buys product A also buys product B.

Commonly used Regression Models

The most commonly used regression models are,

  1. Linear Regression
  2. Polynomial Regression
  3. Support Vector Regression
  4. Decision Tree
  5. Random Forest Regression

Linear Regression

from sklearn.linear_model import LinearRegression

lr = LinearRegression().fit(X_train, y_train)
y_train_pred = lr.predict(X_train)
y_test_pred = lr.predict(X_test)

# Printing score
print("lr.coef_: {}".format(lr.coef_))
print("lr.intercept_: {}".format(lr.intercept_))
print("lr train score: %.3f, lr test score: %.3f" % (lr.score(X_train, y_train), lr.score(X_test, y_test)))

Polynomial Regression

Polynomial Regression affects over-fitting or under-fitting.

from sklearn.preprocessing import PolynomialFeatures

Support Vector Regression (SVR)

Support Vectors was first used for classification problems as SVC(Support Vector Classification). Later it was also applied for Regression problems as SVR to predict numerical data.

In SVR the plane which separates the 2 classes is called the hyperplane. The data points that are sitting the closest to the hyperplane are called the support vectors. The hyperplane that goes through those closest data points are called margins. If we do not allow any of the points to be in the margin area its called Hard margin, if we allow then its called Soft margin.

Kernal functions help to separate the 2 classes by projecting the data into a higher dimension. So that it’s easier to separate them using a hyperplane rather than simple 2D line.

  • SVR is very similar to SVC.
  • Output is a continuous number rather than a category.
  • Goal is to minimize the error and obtain a minimum margin interval which contains the maximum number of data points.

Commonly used Kernal functions,

  • Linear
  • RBF (Radial basis function)
  • Polynomial
  • Exponential
from sklearn.svm import SVR

svr = SVR(kernal="linear", C=300)

# test train split
X_train, y_train, X_test, y_test = train_test_split(X_final, y_final, test_size=0.33, random_state=0)

# standard scaler (fit transform on train, fit only on test)
sc = StandardScaler()
X_train = sc.fit_transform(X_train.astype(np.float))
X_test = sc.transform(X_test.astype(np.float))

# fit model
svr = svr.fit(X_train, y_train.values.ravel())
y_train_pred = svr.predict(X_train)
y_test_pred = svr.predict(X_test)

# print score
print('svr train score %.3f, svr test score: %.3f' % (
    svr.score(X_train, y_train),
    svr.score(X_test, y_test)))

Decision tree

Decision tree model creates a tree based structured decision model. But it tends to overfit. Decision tree does not require data scaling.

from sklearn.tree import DecisionTreeRegressor

dt = DecisionTreeReggressor(random_state = 0)

# test train split
X_train, X_test, y_train, y_test = train_test_split(
    X_final, y_final, test_size=0.33, random_state=0)

# standard scaler (fit transform on train, fit only on test)
sc = StandardScaler()
X_train = sc.fit_transform(X_train.astype(np.float))
X_test = sc.transform(X_test.astype(np.float))

# fit model
dt = dt.fit(X_train, y_train.values.ravel())
y_train_pred = dt.predict(X_train)
y_test_pred = dt.predict(X_test)

# print score
print("dt train score: %.3f, dt test score: %.3f" % (dt.score(X_train, y_train), dt.score(X_test, y_test)))

Random forest regression

Random forest regression is a type of ensemble learning, which means multiple types learning methods are used simultaneously. It is much like the previous decision tree model but instead of using a single tree we use multiple trees and take the opinion of all the trees into consideration. Random forest does not require data scaling.

The main difference between a Regression tree and Classification tree is that the regression tree outputs number and the predicted value is calculated from the mean (average value) and classification tree outputs category and the predicted value is calculated from the mode (most occurring value).

Bagging

Bagging or bootstrap aggregating, subdivides the data into smaller components. So we take the large dataset and then divide it into smaller chunks of data, then we apply machine learning models to each one of them and then finally do the aggregation.

from sklearn.ensemble import RandomForestRegressor

# the n_estimators is the number of trees and the criterion is feature selection method
# n_jobs mentions the number of processors to run in parallel for both training and prediction. -1 means run on all available processors.
forest = RandomForestRegressor(n_estimators = 100, criterion = "mse", random_state = 1, n_jobs = -1)

# test train split
X_train, X_test, y_train, y_test = train_test_split(
    X_final, y_final, test_size=0.33, random_state=0)

# standard scaler (fit transform on train, fit only on test)
sc = StandardScaler()
X_train = sc.fit_transform(X_train.astype(np.float))
X_test = sc.transform(X_test.astype(np.float))

# fit model
forest.fit(X_train, y_train.values.ravel())
y_train_pred = forest.predict(X_train)
y_test_pred = forest.predict(X_test)

# print score
print("forest train score: %.3f, forest test score: %.3f" % (forest.score(X_train, y_train), forest.score(X_test, y_test)))

Hyperparameter optimization

Hyperparameter optimization is the process of deriving the ideal set of parameters for prediction algorithm with optimum performance.

Hyperparameter optimization methods

  • Grid search
  • Random search
  • Bayesian optimization
  • Gradient based optimization
  • Evolutionary optimization
  • Population based
from sklearn.model_selection import train_test_split, GridSearchCV

def print_best_params(gd_model):
  param_dict = gd_model.best_estimator_.get_params()
  model_str = str(gd_model.estimator).split('(')[0]
  print("\n*** {} Best Parameters ***".format(model_str))
  for k in param_dict:
    print("{} {}".format(k, param_dict[k]))
  print()

# test train split
X_train, X_test, y_train, y_test = train_test_split(
    X_final, y_final, test_size=0.33, random_state=0)

# standard scaler (fit transform on train, fit only on test)
sc = StandardScaler()
X_train = sc.fit_transform(X_train.astype(np.float))
X_test = sc.transform(X_test.astype(np.float))

# SVR parameter grid
param_grid_svr = dict(kernel=['linear', 'poly'],
                      degree=[2],
                      C=[600, 700, 800, 900],
                      epsilon=[0.0001, 0.00001, 0.000001])
svr = GridSearchCV(SVR(), param_grid=param_grid_svr, cv=5, verbose=3)


# fit model
svr = svr.fit(X_train, y_train.values.ravel())

# print score
print('\n\n svr train score %.3f, svr test score: %.3f' % (
    svr.score(X_train, y_train),
    svr.score(X_test, y_test)))
print_best_params(svr)

Basic git commands you will ever need

by - Vignesh S