The Iris dataset is a dataset made by biologist Ronald Fisher and contains 50 samples of 3 species of Iris, Iris Satosa, Iris Verginica and Iris Versicolor summing up to 150 data points.
Four features were considered, the petal length, the sepal length, petal width and sepal width and using these four features, we will be differentiating the flowers.
import numpy as np
import pandas as pd
#plotting the data
import matplotlib.pyplot as plt
import seaborn as sns
# from pandas.tools.plotting import scatter_matrix # we dont need this here
#preprocessing the dataset
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import MinMaxScaler
# #feature selection
from sklearn.decomposition import PCA
#sample selection/training and test set selection
from sklearn.model_selection import train_test_split
#classification algorithms
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
#ensembled classificaiton algorithms
from sklearn.ensemble import RandomForestClassifier
#analysis of classificaiton algorithms
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.metrics import accuracy_score
from sklearn.metrics import mean_squared_error
#remove warning to make the notebook prettier
import warnings
warnings.filterwarnings('ignore')
#plot graphs in the notebook
%matplotlib inline
iris = pd.read_csv("../dataset/IRIS.csv")
iris.shape
iris.columns
#view 5 random samples from the dataset
iris.sample(5)
iris['species'].value_counts()
iris.info()
All the feature columns are numeric and our target column, the species of the flower are of categorical type.
iris.describe().T
Descrive has given us the 5 number summary of the dataset
1) we see that petal length has a range 1.0-6.9 and the highest standard daviation in all the 4 features. We can take it as an important feature as a higher standard deviation means a higher variance implying it provides more data than the other features.
2) sepal width has the least standard daviation so we can consider it a lower valued feature (gives less information)
iris.isna().sum()
There are no null values in the dataset and there are no zeroes as well (we figured that out from the description of the dataset). So we do not need to impute anything in the data.
Since the number of features are low(4) we can plot a pairplot.
plt.figure(figsize=(14,14))
sns.pairplot(data=iris, hue='species')
To visualize the data, we can consider only the lower or upper part of the plot as they will be the same.
* Iris satosa is almost distinctable while iris-versicolor and iris-verginica are highly colocated
2) sepal-Length vs petal-length
* Iris Satosa is complelely distinct and Iris versicolor and Iris verginica are almost linearly separable.
3) sepal-Length vs petal-width
* Iris Satosa again is completely separable while Iris versicolor and Iris verginica are colocated.
* Iris Satosa is completely separable while Iris versicolor and Iris verginica are colocated.
2) Sepal Width vs petal Width
* Iris Satosa is completely separable while Iris versicolor and Iris verginica are colocated.
* Iris Satosa is completely separable while Iris versicolor and Iris verginica are colocated.
Looking at the frequency distribution we see that :-
Let us group the data by the species of the flowers
iris_species = iris.groupby('species')
iris_species.describe().T
We see that the description by species compliments the inferences we made from the pairplot.
Now that we have explored the data, we can now apply machine learning algorithms to differentiate between the three Iris species. But before we get into that, we need to make sure that all our data is numeric as all our algorithms only work with numeric data. So, we will be Label Encoding our target variable, 'Species'.
And since this is a classification problem, we will be using classification algotithms like KNN, SVM and Decision Tree with ensembling algorithms like Random Forest Classifier and then use KFold cross validation on top of it all.
We will not use Logistic Regression in this problem because it's better suited for bivariate targets and not multi-variate, 3 in our case.
If we had null or categorical values in the features columns we would have done the data preprocessing step before the visualization depending on the problem statement.
#Label encoding the target
label_encoder = LabelEncoder()
iris['species'] = label_encoder.fit_transform(iris['species'])
print(iris['species'].value_counts())
print('\n{}'.format(iris.info()))
#we use heatmap after encoding as 'Species' was a categorical value and heatmap can only use numeric values
plt.figure(figsize=(8,6))
sns.heatmap(iris[0:4], annot=True)
Since none of the features are corelated, we can not make any conclusions on what features to choose and which ones not to. If there was a corelation, we would have chosen those features that were highly corelated(>0.7 or <-0.7) as our features as they will affect our target the most. Then we would have checked for feature to feature corelation to drop highly corelated features and pick the one with higher variance.
Now that all our feature and target columns are numeric, we can scale them and then apply machine learning algorithms to distinguish between the Iris species.
We will minmax scale the data because it's generally better than standard scaler and it accomodates outliers if there are any.
all_features = iris.columns.values
all_features = list(all_features)
all_features.remove('species')
all_features = np.array(all_features)
print(all_features)
minmax_scaler = MinMaxScaler()
iris[all_features] = minmax_scaler.fit_transform(iris[all_features])
iris.sample(8)
Now that our data is scaled and all the values of features lie between 0-1, it will be easier for our model to learn from the data. We can continue with our model building now.
#splitting data into features and target
features = iris.iloc[:,0:4]
target = iris.iloc[:,4]
#random state is 42 because it's a good number and is the answer to everything in the universe according to a popular theory.
x_train, x_test, y_train, y_test = train_test_split(features, target, test_size=0.2, random_state = 42)
print(x_train.shape, x_test.shape, y_train.shape, y_test.shape)
Now that we have split the dataset into training and testing data, we will use all 4 features and apply our algorithms to it and see if they work well. If the result is not satisfactory, we will use feature selection methods.
decision_tree_classifier = DecisionTreeClassifier(min_impurity_decrease = 0.05)
# the less the value of min_impurity_decrease, more is the chance of the model overfitting (depending on the number of features)
decision_tree_classifier.fit(x_train, y_train)
decision_tree_train_score = decision_tree_classifier.score(x_train, y_train)
decision_tree_test_score = decision_tree_classifier.score(x_test, y_test)
print(decision_tree_train_score, decision_tree_test_score)
In Decision Tree Classifier, .score returns the R^2 score which is the coefficient of determination and is not a good measure, so we will use sklearn's accuracy-score method instead.
decision_tree_y_pred = decision_tree_classifier.predict(x_test)
decision_tree_accuracy = accuracy_score(decision_tree_y_pred, y_test)
decision_tree_accuracy
From the accuracy score, we see that our model performs pretty well with 96% accuracy. Since it is not .99 and above, it has not overfit either.
decision_tree_mse = mean_squared_error(y_test, decision_tree_y_pred)
decision_tree_rmse = np.sqrt(decision_tree_mse)
print(decision_tree_mse, decision_tree_rmse)
The low MSE and RMSE values of the model further support our high accuracy.
decison_tree_confusion_matrix = confusion_matrix(y_test, decision_tree_y_pred)
decison_tree_confusion_matrix
decison_tree_classification_report = classification_report(y_test, decision_tree_y_pred)
print(decison_tree_classification_report)
From the confusion matrix, we get the true positives, true negatives, false positives and false negatives and since most of them are perfect, we might have to consider our model being overfit.
The classificaiton report as well tells us that out modedl has perfect precision on 2 categories leading to the suspision of overfitting.
knn_classifier = KNeighborsClassifier(n_neighbors = 3) #since our target has 3 values
knn_classifier.fit(x_train, y_train)
knn_train_score = knn_classifier.score(x_train, y_train)
knn_test_score = knn_classifier.score(x_test, y_test)
print(knn_train_score, knn_test_score)
knn_y_pred = knn_classifier.predict(x_test)
knn_accuracy = accuracy_score(y_test, knn_y_pred)
knn_accuracy
knn_mse = mean_squared_error(y_test, knn_y_pred)
knn_rmse = np.sqrt(knn_mse)
print(knn_mse, knn_rmse)
There is 0 error in our model which implies that our model is perfect implying our model has overfit.
knn_confusion_matrix = confusion_matrix(y_test, knn_y_pred)
knn_confusion_matrix
knn_classification_report = classification_report(y_test, knn_y_pred)
print(knn_classification_report)
Though KNN is a great classification algorithm with near perfect predictions, the 100% accuracy on the test set leads us to believe that the model has overfit or we do not have enough data and need more.
sv_classifier = SVC(kernel='rbf')
sv_classifier.fit(x_train, y_train)
sv_train_score = sv_classifier.score(x_train, y_train)
sv_test_score = sv_classifier.score(x_test, y_test)
print(sv_train_score, sv_test_score)
sv_y_pred = sv_classifier.predict(x_test)
sv_accuracy = accuracy_score(y_test, sv_y_pred)
sv_accuracy
sv_mse = mean_squared_error(y_test, sv_y_pred)
sv_rmse = np.sqrt(sv_mse)
print(sv_mse, sv_rmse)
Though SVMs are really powerful, the 0 error implies our support vector classifier has overfit.
sv_confusion_matrix = confusion_matrix(y_test, sv_y_pred)
sv_confusion_matrix
sv_classification_report = classification_report(y_test, sv_y_pred)
print(sv_classification_report)
Since the accuracy is 100% again, we must conclude that the model has overfit. We will have to use ensembling techniques to work around this problem.
pca = PCA(0.95) #Takes all features that make up for 95% of all information in the data
pca_features = pca.fit_transform(features)
pca_features.shape
We see that PCA has chosen 2 features. We can not see which features PCA chose unlike in RFC but PCA chooses the features according to their variance.
pca.explained_variance_
pca.explained_variance_ratio_
We see that the variance of the attributes chosen is high. PCA choses the highest variance features.
x_train_pca, x_test_pca, y_train_pca, y_test_pca = train_test_split(pca_features, target, test_size=0.2, random_state=42)
print(x_train_pca.shape, x_test_pca.shape, y_train_pca.shape, y_test_pca.shape)
dtree_classifier_pca = DecisionTreeClassifier(min_impurity_decrease=0.05)
dtree_classifier_pca.fit(x_train_pca, y_train_pca)
dtree_pca_train_score = dtree_classifier_pca.score(x_train_pca, y_train_pca)
dtree_pca_test_score = dtree_classifier_pca.score(x_test_pca, y_test_pca)
print(dtree_pca_train_score, dtree_pca_test_score)
dtree_pca_y_pred = dtree_classifier_pca.predict(x_test_pca)
dtree_pca_accuracy = accuracy_score(y_test_pca, dtree_pca_y_pred)
dtree_pca_accuracy
dtree_pca_mse = mean_squared_error(y_test, dtree_pca_y_pred)
dtree_pca_rmse = np.sqrt(dtree_pca_mse)
print(dtree_pca_mse, dtree_pca_rmse)
The really low MSE and RMSE values show that our model is pretty good at classifying the Iris flower species.
confusion_matrix_dtree_pca = confusion_matrix(y_test_pca, dtree_pca_y_pred)
confusion_matrix_dtree_pca
classification_report_dtree_pca = classification_report(y_test_pca, dtree_pca_y_pred)
print(classification_report_dtree_pca)
We see that the accuracy is 96% which implies that our model has not overfit.
knn_classifier_pca = KNeighborsClassifier(n_neighbors=3)
knn_classifier_pca.fit(x_train_pca, y_train_pca)
knn_pca_train_score = knn_classifier_pca.score(x_train_pca, y_train_pca)
knn_pca_test_score = knn_classifier_pca.score(x_test_pca, y_test_pca)
print(knn_pca_train_score, knn_pca_test_score)
knn_pca_y_pred = knn_classifier_pca.predict(x_test_pca)
knn_pca_accuracy = accuracy_score(y_test_pca, knn_pca_y_pred)
knn_pca_accuracy
knn_pca_mse = mean_squared_error(y_test, knn_pca_y_pred)
knn_pca_rmse = np.sqrt(knn_pca_mse)
print(knn_pca_mse, knn_pca_rmse)
The error on our KNN using PCA features has a larger value of error and less accuracy that our decision tree model using PCA features.
confusion_matrix_knn_pca = confusion_matrix(y_test_pca, knn_pca_y_pred)
confusion_matrix_knn_pca
classification_report_knn_pca = classification_report(y_test_pca, knn_pca_y_pred)
print(classification_report_knn_pca)
The 100% accuracy implies that our model has overfit. This occured because we used ensembling on such a small dataset.
sv_classifier_pca = SVC(kernel='rbf')
sv_classifier_pca.fit(x_train_pca, y_train_pca)
svc_pca_train_score = sv_classifier_pca.score(x_train_pca, y_train_pca)
svc_pca_test_score = sv_classifier_pca.score(x_test_pca, y_test_pca)
print(svc_pca_train_score, svc_pca_test_score)
svc_pca_y_pred = sv_classifier_pca.predict(x_test_pca)
svc_pca_accuracy = accuracy_score(y_test_pca, svc_pca_y_pred)
svc_pca_accuracy
svc_pca_mse = mean_squared_error(y_test, svc_pca_y_pred)
svc_pca_rmse = np.sqrt(svc_pca_mse)
print(svc_pca_mse, svc_pca_rmse)
Again, our support vector classifier has 0 error implying overfitting.
confusion_matrix_svc_pca = confusion_matrix(y_test_pca, svc_pca_y_pred)
confusion_matrix_svc_pca
classification_report_svc_pca = classification_report(y_test_pca, svc_pca_y_pred)
print(classification_report_svc_pca)
SVC is a very powerful algorithm but a 100% accuracy still means that it has overfit.
rfc_pca = RandomForestClassifier(n_estimators=20, min_samples_split=15, min_impurity_decrease=0.05)
rfc_pca.fit(x_train_pca, y_train_pca)
rfc_pca_train_score = rfc_pca.score(x_train_pca, y_train_pca)
rfc_pca_test_score = rfc_pca.score(x_test_pca, y_test_pca)
print(rfc_pca_train_score, rfc_pca_test_score)
rfc_pca_y_pred = rfc_pca.predict(x_test_pca)
rfc_pca_accuracy = accuracy_score(y_test_pca, rfc_pca_y_pred)
rfc_pca_accuracy
rfc_pca_mse = mean_squared_error(y_test, rfc_pca_y_pred)
rfc_pca_rmse = np.sqrt(rfc_pca_mse)
print(rfc_pca_mse, rfc_pca_rmse)
The RFC has really low MSE and RMSE values. Hence, it has not overfit.
confusion_matrix_rfc_pca = confusion_matrix(y_test_pca, rfc_pca_y_pred)
confusion_matrix_rfc_pca
classification_report_rfc_pca = classification_report(y_test_pca, rfc_pca_y_pred)
print(classification_report_rfc_pca)
We got an accuracy of 96% which means that our model has not overfit.
We see that except for Decision tree and Random Forest, all the other algorithms overfit on the iris dataset. We also see that using PCA did not make much of a difference in the accuracy but it reduced the computation as it chose 2 features instead of all 4 and still took 95% of the information the data had to provide.
Thus, We can conclude that Decision Tree Classifier and Random Forest Classifier over the selected features by PCA were the best algorithms for this problem statement.