Table Of Contents
- 2.6 Missing data
- Facebook image/face tagging.
- Kinect motion detection.
- VR headset movement.
- Speech to text, text to speech.
- Swipe keyboard prediction.
- IOT robot dogs.
- Facebook ads.
- Amazon netflix use ml to recommender systems.
- Used in the field of medicine for detection.
- Used to explore new areas through space satellites.
- Explore new areas like mars.
- Since the dawn of time up until 2005 humans have created 130 Exabytes of data.
- 2005 - 2010 that has become 1200 Exabytes of data.
- 2010 - 2015 it is 7900 Exabytes.
- By 2020, 40900 Exabytes of data will be created.
Install Anaconda 4.2.0 if you are facing some compatibility issues.
The first data set contains 4 columns country, age, salary and purchased
So it’s a data set of the customers of a company with the customers information and whether or not the purchased their product. The first 3 are the independent variables/features. And the last column is the dependent variable/label which is what we need to predict.
Let’s begin by importing necessary libraries.
import numpy as np # n-dim array math library import matplotlib.pyplot as plt # plotting/charting library import pandas as pd # importing data sets and managing data sets
Go to file explorer - click on a button to set a folder as working dir or save and run the py file from the same folder as CSV file to set it as working dir (F5 to run)
dataset = pd.read_csv(‘data.csv’) # importing the CSV file
You can verify the above import statement by looking at the variable explorer in spyder. Now let’s create our matrix of features.
X = dataset.iloc[:, :-1].values # the left side of the comma is the rows to include and the right side it the columns and we have excluded the last colum as its the y and not x Y = dataset.iloc[:, 3] #indexes in python start at 0 so we need the 4th column so we enter 3
We cant remove the rows containing missing data because it may contain other crucial data that we might need the other safer method is to fill the missing data using any of the 3 methods Mean , Median or Mode if you cant see the full array in the variable explorer just add the below code
np.set_printoptions(threshold = np.nan)
from sklearn.preprocessing import Imputer # Create an object of the above class imputer = Imputer(missing_values = NaN, strategy=“mean”, axis = 0) # missing values is NaN because that is what shows up for the missing values in the variable explorer, axis 0 means along the columns imputer = imputer.fit(X[:, 1:3]) # its 3 because the upper bound is excluded here X[:, 1:3] = imputer.transform(X[:, 1:3]) # Now we take the transformed values and replace into the actual data
To inspect any imported method or class, keep the cursor over it and press
cmd + i to open up the docs for that method/class.
There are 3 strategies available to fill the missing values.
axis = 0 means fill along the columns,
axis = 1 means fill along the rows.
We fit the X data to the imputer for the columns where there are missing data. Hence
X[:, 1:3] which means all the rows and columns 1 and 2.
To run just a lock of code in
spyder just select the lock of new code and hit
cmd + enter.
# Encoding categorical data from sklearn.preprocessing import LabelEncoder labelencoder_X = LabelEncoder() X[:, 0] = labelencoder_X.fit_transform(X[:, 0])
We use the
fit_transform method because we want to fit and transform at the same time.
Now that we have converted the categorical values of
france, germany and spain into numbers
0, 1, 2, We now need to convert it to one hot encoding.
As currently the ml algorithm will think Germany is greater than France and Spain is greater than Germany which is not the case.
So lets import the
from sklearn.preprocessing import Imputer, OneHotEncoder ... onehotencoder = OneHotEncoder(categorical_features = ) # The columns which needs to be transformed X = onehotencoder.fit_transform(X).toarray()
Now lets do the Label encoding to Y
labelencoder_Y = LabelEncoder() Y = labelencoder_Y.fit_transform(Y)
# Splitting the dataset into train set and test set from sklearn.model_selection import train_test_split X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.2, random_state = 0)