import numpy as np
import pandas as pd
#visualization libraries
import matplotlib.pyplot as plt
import seaborn as sns
#preprocessing
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import MinMaxScaler
#Checking Z value for outlier treatment
from scipy import stats
from sklearn.decomposition import PCA
#splitting the dataset into training and testing. 60:20:20
# 60% for training, 20% for validation and 20% for testing.
from sklearn.model_selection import train_test_split
#picking models for prediction.
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
#ensemble models for better performance
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import RandomForestClassifier
#error evaluation
from sklearn.metrics import mean_squared_error
#ignore warning to make notebook prettier
import warnings
warnings.filterwarnings('ignore')
%matplotlib inline
#displays all rows and all columns without cutting anything.
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
path = '../dataset/adult-earning-potential/adult_data.csv'
adult_data = pd.read_csv(path)
adult_data.head().T
We see that the data has no head column. The headers are in another csv file. So, we will read that file and add it to the top of this csv file.
adult_header = pd.read_csv('../dataset/adult-earning-potential/adult_names_head.csv')
print(adult_header)
col_names = list(adult_header.columns)
print('col_names: {} \n length: {}'.format(col_names, len(col_names)))
len(list(adult_data[0:1:-1])) #checks the number of columns
We see that our header file is missing 4 columns as the length of the columns of the header and the data file are 11 and 15 respectively. We have to recheck the datasets to see which columns we are missing.
After checking, we see that the final column header is:
['Age','Workclass','fnlwgt','Education','Education-num','Marital_Status','Occupation','Relationship','Race','Sex','Capital-gain','Capital-loss','hrs_per_week','Native-Country','Earning_potential']
We will read our data and add this list as our header.
data_header = ['Age','Workclass','fnlwgt','Education','Education-num','Marital_Status','Occupation','Relationship','Race','Sex','Capital-gain','Capital-loss','hrs_per_week','Native-Country','Earning_potential']
adult_data = pd.read_csv(path, names = data_header)
adult_data.head()
adult_data.info()
We see that there are no null columns. But sometimes, the NaN values are replaced with other special characters. We need to check for this. We will generate random samples of a small size to see if we find any special characters in the data.
adult_data.sample(50)
From the random samples, we see that there are '?' in the dataset. These mean Null values. So we will search for the columns with '?' and then deal with them.
adult_data = adult_data.replace(to_replace = '%?%', value = np.nan) #replaces everything with a '?' with Nan
adult_data.isna().sum()
We now need to treat the missing data. For this, we will check the columns and replace those values with the Mode if they are categorical (if the mode is really high compared to other values) or the median if the value is numeric.
Before that we need to check for the types of data in each column (categorical or numeric)
all_columns = list(adult_data.columns)
print('all_columns:\n {}'.format(all_columns))
categorical_columns = list(adult_data.select_dtypes(include=['object']).columns)
print('Categorical columns:\n {}'.format(categorical_columns))
numerical_columns = list(adult_data.select_dtypes(include=['int64', 'float64']).columns)
print('Numerical columns:\n {}'.format(numerical_columns))
Now that we have checked the data and have found the numeric and categorical columns, we can now proceed with the analysis.
adult_data.info()
As we had replaced the'?' with null values, we see that the non-null sizes of each column is different. We have already calculated the null values of those missing columns. We will work on them too.
adult_data.describe().T
From the 5 number summary of the describe function, we see that the data in Capital-Gain and Capital-loss are vastly spread out with really high variance. This either means the data contains a large portion of the information, or, it has a lot of extreme values. In our case, these contain extreme values and these are 0 (zero) in a lot of rows.
The average of the age column is 38.58 and the median is 37 (50th percentile) implying that these people have families to take care of. This may be a factor to consider when dealing with the performance of the individual.
fnlwgt is the weight of people which usually depends on the demography they belong to so we should consider it with the country they belong to.
Hours of work per week also depends on the country and the field the person belongs to. Self employed people may work more or less hours than employees. Profession also determines the hours of work per week.
We will now explore the categorical columns.
len(categorical_columns)
We will be checking for the count of the categorical data and getting inferences from it.
plt.figure(figsize = (15,10))
sns.countplot(adult_data['Workclass'])
plt.show()
We realize that most of the people in the survey belong to the Private sector. This is a biased data as we barely have any information on the other kind of workclasses.
adult_data['Workclass'].value_counts()
adult_data.groupby(['Workclass', 'Earning_potential']).size()
Everyone in the survey who has worked without pay earns <=50 because... well... They aren't paid xD
Same goes for people who have never worked. Maybe their families are filthy rich.
We do see that the private sector that employes most of the people in the survey also has over 3 times the number of employees who earn less than 50k compared to their counterparts who earn more than 50k.
This may be because our data is biased.
plt.figure(figsize = (15,10))
sns.countplot(adult_data[categorical_columns[1]])
plt.show()
This is an interesting graph as most of the people in the survey have either attended HS-grad, have a Bachelor's degree or have attended some collage.
adult_data['Education'].value_counts()
adult_data.groupby(['Education', 'Earning_potential']).size()
From this we understand that most people with lower level of education usually earn less than 50k. There are exceptions however and that may be so because of their experiences or learning things on their own. Or just being really good at their trade.
Only for bachelors we see the gap between people earning >50k and <=50k is fairly low. That may be so because bachelors is a degree with many trades involved and talent and hardwork usually pays off real well.
Everyone who has not studied beyond preschool earns less than 50k
On the other hand people who have persued higher education like Masters and Doctorate are more likely to earn >50k. We need to analyze other factors to trim this down.
adult_data.groupby(['Education', 'Workclass']).size()
This gives us insigts on the relationship between workclass and education.
we see that the only people who work without pay are the ones who had Assoc-acdm, are HS-Grads or those who went to some collage.
We also see that no matter the educational level, the highest number of people in each category work in the private sector.
The mode of each of those columns is of the private sector as well.
plt.figure(figsize = (15,10))
sns.countplot(adult_data[categorical_columns[2]])
plt.show()
adult_data['Marital_Status'].value_counts()
adult_data.groupby(['Marital_Status', 'Earning_potential']).size()
We see that when a person is married and has a civilian spouce, they tend to have a lower difference between the two classses of income potential.
The biggest difference is among the people who are never married. Almost all of them earn less than 50k. This may be because they are relatively younger so they have less experience or there can be a range of other factors (like education).
adult_data.groupby(['Marital_Status', 'Workclass']).size()
We again see that in all the marital statuses, private sector is where most people work. It usually is the case with an overwhelming majority.
Except for people with a married civilian spouse where local-govt and self-employed-not-inc had 1600+ and 1000+ entries with private sectors were at 9.7k.
adult_data.groupby(['Marital_Status', 'Education']).size()
Most people who did their masters, bachelors or went to some collage tend to have civilian spouses. Bachelors and people who went to some collage are also amongst those who were not married.
This may be because of the number of people with bachelors, hs-grads and people going to some collage are higher than most other recordings.
However, a lot of the people who did their masters did prefer civilian spouses.
plt.figure(figsize=(25,15))
sns.countplot(adult_data['Native-Country'])
plt.show()
adult_data['Native-Country'].value_counts()
That is a highly biased dataset. So we can just replace all the null values with US and it will be fine.
We also notice that there is no inconsistency in the data (like US, USA, United States in one dataset) so we do not need to worry about it.
adult_data['Occupation'].value_counts()
Occupation does not have a biased report so we can check for it's connections with other columns to see if we can replace the null values with anything.
adult_data.groupby(['Occupation', 'Education']).size()
As we had expected, Bachelors, HS-Grads and some-collage rule this side as well. But there are some interesting things we found.
In Tech SUpport, Bachelors, Some-collage are mostly present with HS-grads coming in third followed by Assoc-voc.
Bachelors, Masters and Doctorates prefer Prof-speciality fields of work. Most masters and Doctorates work in this field. They are also the ones who usually get paid >50k so this makes a lot of sense.
adult_data.groupby(['Occupation', 'Workclass']).size()
Here again, Private sector is the most preferred sector except for Farming-Fishing where self-emp-not-inc are present in abundence. Since they are usually not the ones who get paid >50k, farming is probably not a very profitable. Most of the people who work without pay also stay in this trade.
plt.figure(figsize=(15,10))
sns.countplot(adult_data['Race'])
adult_data.groupby(['Race', 'Earning_potential']).size()
This is a white dominated dataset and we see that the ratio of race with >50k and <50k is: aie - 7.6 api - 2.7 b - 7.0 o - 9.8 w - 2.9
This concludes that White and Asian-Pac_Islanders have a less ratio between the people earning >50k and <=50k while it is more prominent in the rest of the races, especially for the people tagged 'Other'
This may be because of education. So let's check that.
adult_data.groupby(['Race', 'Education']).size()
We see that Most white people are HS-Grads and a lot of them have gone to some collage. The next highs are Bachelors and masters.
We see that most people belonging to the Other category do not have many high degrees and that might explain the low income potential.
We see a lot of people belonging to the Black category are HS-Grads. Since HS-Grads have a high number of people with income potential <=50k, this may explain why the ratio is so high.
For asia-pac-islanders, a lot of people in their category have done masters (compared to other degrees it isnt that low) which may have contributed to their higher income potential.
plt.figure(figsize=(15,10))
sns.countplot(adult_data['Sex'])
adult_data.groupby(['Sex', 'Earning_potential']).size()
We see that when it comes to the ratio of Males being paid >50k and <-50k is around 2.3 but for women, the same ratio drops down to 8.1. This implies that women are being paid less. We have to check the same for every occupation to see if women work in occupations that usually pay less. This can go both ways.
adult_data.groupby(['Education', 'Sex']).size()
We see that women are a lot more prominent in the lower levels of education and then in bachelors, hsgrads and some-collage. This implies a lot of women quit during school.
adult_data.groupby(['Occupation', 'Sex']).size()
Women dominate the adm-clerical, other services, priv-house-ervices.
However in all the services, a majority of people get paid <=50k.
We may need to check the numeric columns and then plot a heatmap and check the corelation of all the columns with the target in order to get a better idea about the dataset.
numerical_columns
# for i in range(len(numerical_columns)):
# plt.figure(figsize=(15,10))
# sns.distplot(adult_data[numerical_columns[i]])
# plt.show()
# Too many graphs
# for i in range(len(numerical_columns)):
# plt.figure(figsize=(15,10))
# sns.boxplot(adult_data[numerical_columns[i]])
# plt.show()
# Too many gtaphs
adult_data.var(axis=0)
adult_data.loc[:, numerical_columns].var()
let's make our lives easier and convert it to a float value.
var_in_float = adult_data.loc[:, numerical_columns].var()
for i in range(len(numerical_columns)):
print('{} \t\t {}'.format(numerical_columns[i], round(float(var_in_float[i]), 3)))
We see that fnlwgt, capital-gain and capital loss have the highest variance. This can occur either because these have a lot of information or... they have few, very extreme values. Let's check those out.
plt.figure(figsize=(15,10))
sns.distplot(adult_data['fnlwgt'])
plt.show()
Do not be fooled by the tiny 0.2 steps because at the end, there's a 1e6. It means it is 1.0e+06 or 1000000 or 1 * 10^6 So that 0.2 is actually 200000. That would explain the high variance.
plt.figure(figsize=(15,10))
sns.boxplot(adult_data['fnlwgt'])
plt.show()
We see that there are a large number of outliers here. Our median lies between the 0.2 * 10^6 side but a lot of other points cross our 75th percentile. We will have to treat this column for outliers.
plt.figure(figsize=(15,10))
sns.distplot(adult_data['Capital-gain'])
plt.show()
This graph is quite interesting. Most of our data tends towards the zero side of the graph. However, some of the data is in the 5k-20k range and there is some data in the 100,000 range as well! Now that outlier right there would throw off our variance by a lot. We need to deal with those outliers eventually or when we try to make models later, we will not be able to make a good prediction.
plt.figure(figsize=(15,10))
sns.boxplot(adult_data['Capital-gain'])
plt.show()
That just looks like a lot of outliers as almost all of our data was centered towards 0 impling very few people got a capital gain. Without much capital gain, it is difficult to break the <=50k barrier. That would help explain why so many people in the survey had a income potential of <=50k.
plt.figure(figsize=(15,10))
sns.distplot(adult_data['Capital-loss'])
plt.show()
We again see the data is centered towards 0 with some outliers near 2000. We will have to clean or scale this data.
plt.figure(figsize=(15,10))
sns.boxplot(adult_data['Capital-loss'])
plt.show()
Again, there are a large amount of people with no capital loss. We also saw a large amount of people do not have any capital gain either. So maybe people in our sample do not invest or have passive income or take risks and so on. This is kind of sad to see. But at least there were not large losses. The highest loss we see is somewhere in the range of 5000.
plt.figure(figsize=(15,10))
sns.distplot(adult_data['Education-num'])
plt.show()
plt.figure(figsize=(15,10))
sns.boxplot(adult_data['Education-num'])
plt.show()
We see that most of the people fall within the 9-12 range with data skewed towards the left. Which is the HS-grad part of the graph. And few people are well below our education number threshold at 4 to 1.
plt.figure(figsize=(15,10))
sns.distplot(adult_data['hrs_per_week'])
plt.show()
We see that a lot of people work around 40 hours per week. We see there are some people who work towards the 0 side of the graph. They may be people who work without pay and those who do not get paid at all. We will check for those as well.
plt.figure(figsize=(15,10))
sns.boxplot(adult_data['hrs_per_week'])
plt.show()
We see a lot of people work 40 hour weeks but that ranges from 30-50. However a lot of people people work a lot longer and less than that. Some even work 100 hour weeks! They are either very passionate, or are having a very bad time.
plt.figure(figsize=(15,10))
sns.distplot(adult_data['Age'])
plt.show()
We see the Age is right skewed as more people at an younger age work in this survey.
plt.figure(figsize=(15,10))
sns.boxplot(adult_data['Age'])
plt.show()
We see that most people who work are within the age group of 17 to a little less than 80. 80 is... well hats off to them and to those who work well beyond that up to their early 90s. It is facinating to see people work till that age. They must be very passionate about what they do. Or there may be something sadder at play.
Correlation between the numeric columns.
adult_data.corr()
We see that we see no direct corelation between any of the data. This does not mean that none of the data here is correlated, we just haven't been able to find that correlation yet.
Lets encode and scale our data. That way, our models will have an easier time working with our data.
Filling the null columns.
null_columns = adult_data.columns[adult_data.isnull().any()]
adult_data[null_columns].isnull().sum()
Checking for mode of the Null columns. We can directly check for the mode instead of checking for tyhe type of the columns because we have previously established at the beginning that only 3 categorical columns have null values.
# adult_data.loc[:, null_columns].mode()
#checking for dataset info before replacing columns.
adult_data.info()
for i in list(null_columns):
adult_data[i].fillna(adult_data[i].mode().values[0],inplace=True)
print('{null_sum} \n\n {adult_data_info}'.format(null_sum=adult_data.isna().sum(), adult_data_info=adult_data.info()))
Now that we have treated our data and cleaned our null values, we can go ahead and encode our data.
Label Encoding our categorical columns. We can one hot encode them too but that's a whole different thing in itself.
adult_data[categorical_columns].head()
label_encoder = LabelEncoder()
encoded_adult_data = adult_data
for i in categorical_columns:
encoded_adult_data[i] = label_encoder.fit_transform(adult_data[i])
encoded_adult_data[categorical_columns].head()
min_max_scaler = MinMaxScaler()
scaled_encoded_adult_data = pd.DataFrame()
column_values = encoded_adult_data.columns.values
column_values = column_values[:-1]
print(column_values[-1])
scaled_values = min_max_scaler.fit_transform(encoded_adult_data[column_values])
for i in range(len(column_values)):
scaled_encoded_adult_data[column_values[i]] = scaled_values[:,i]
scaled_encoded_adult_data['Earning_potential'] = encoded_adult_data['Earning_potential']
scaled_encoded_adult_data.sample(10)
# encoded_adult_data.head()
scaled_encoded_adult_data.info()
scaled_encoded_adult_data.describe().T
for i in range(len(numerical_columns)):
plt.figure(figsize=(15,10))
sns.boxplot(scaled_encoded_adult_data[numerical_columns[i]])
plt.show()
As we can see in the graphs above, Scaling does nothing to the distribution and does not deal with the outliers either. We have to take care of the outliers.
We have already established the numeric and categorical columns. So, it will be easier for us to deal with them now.
def outlier_detector(datacolumn):
sorted(datacolumn)
Q1,Q3 = np.percentile(datacolumn,[25,75])
IQR = Q3 - Q1
lower_bound = Q1-(1.5*IQR)
upper_bound = Q3+(1.5*IQR)
return lower_bound,upper_bound
# This takes a column of the dataframe (a series),
# checks for the percentile we want to check it for and then calculates and the upper and lower bounds
lowerbound, upperbound = outlier_detector(scaled_encoded_adult_data['Age'])
lowerbound, upperbound
scaled_encoded_adult_data[(scaled_encoded_adult_data.Age < lowerbound) | (scaled_encoded_adult_data.Age > upperbound)]
new_columns = numerical_columns.copy()
new_columns.remove('Capital-gain') #Sparse column, must not be treated
new_columns.remove('Capital-loss') #Sparse column, must not be treated
new_columns
treated_scaled_encoded_adult_data = scaled_encoded_adult_data.copy()
for i in new_columns:
lowerbound, upperbound = outlier_detector(treated_scaled_encoded_adult_data[i])
median = treated_scaled_encoded_adult_data[i].median()
treated_scaled_encoded_adult_data[i] = treated_scaled_encoded_adult_data[i].replace(
to_replace = treated_scaled_encoded_adult_data[(treated_scaled_encoded_adult_data[i] < lowerbound) |
(treated_scaled_encoded_adult_data[i] > upperbound)][i],
value = median)
print('{}: number of outliers: {}'.format(i,treated_scaled_encoded_adult_data[
(treated_scaled_encoded_adult_data[i] < lowerbound) |
(treated_scaled_encoded_adult_data[i] > upperbound)][i]))
Now that we have treated our outliers, we can now go ahead and plot a correlation heatmap.
fig,ax=plt.subplots(figsize=(20,15))
ax=sns.heatmap(treated_scaled_encoded_adult_data.corr(),annot=True)
From the heatmap we see that none of the columns are correlated to each other,
i.e. None of them have a correlation value of >0.7 or <-0.7.
So we must find another way to find our features.
Selecting all features and the target column
print(all_columns)
features = all_columns[:-1]
target = treated_scaled_encoded_adult_data['Earning_potential']
print(features)
print(treated_scaled_encoded_adult_data.shape)
We will now make a new dataframe and use it for our train test splitting.
feature_df = treated_scaled_encoded_adult_data[features]
print(target.head())
feature_df.head()
We will not be using PCA for feature extraction because as we have seen before, a lot of columns have very high variance but not necessarity contribute much to the data. So, using PCA will be a bad idea because we might end up picking up high variance data that has nothing to do with our problem.
x_train, x_test, y_train, y_test = train_test_split(feature_df, target, test_size=0.2)
print(x_train.shape,y_train.shape, x_test.shape, y_test.shape)
We shall build models to check if they perform well after out data preprocessing
We will start with Logistic Regression. Since the target column is a bivariate value, LogisticRegression can be used.
logistic_regressor = LogisticRegression()
logistic_regressor.fit(x_train, y_train)
logistic_train_score = logistic_regressor.score(x_train, y_train)
logistic_test_score = logistic_regressor.score(x_test, y_test)
logistic_prediction = logistic_regressor.predict(x_test)
print('Train Score: {0}\nTest Score: {1}'.format(logistic_train_score, logistic_test_score))
logistic_mse = mean_squared_error(y_test, logistic_prediction)
logistic_rmse = np.sqrt(logistic_mse)
print(logistic_mse, logistic_rmse)
We see that our logistic regression does not perform well. This is because we have multiple features which makes it difficult for Logistic regression to predict values. That is why we will be using other algorithms and testing them.
Before we start building our KNN model, we need to check for what value of K does our model have the least error. That will help us build a more optimal model
error_rate = []
# Will take some time
k_values = list(filter(lambda x: x%2==1, range(0,50)))
best_k = 0
for i in k_values:
knn = KNeighborsClassifier(n_neighbors=i)
knn.fit(x_train,y_train)
pred_i = knn.predict(x_test)
error_rate.append(np.mean(pred_i != y_test))
print(error_rate.index(np.min(error_rate)))
if index is 12, the value of k should be 2 * index + 1
Thus, The optimum value of K is 25.
plt.figure(figsize=(10,10))
plt.plot(k_values,error_rate,color='blue', linestyle='dashed', marker='o',
markerfacecolor='red', markersize=10)
plt.title('Error Rate vs. K Value')
plt.xlabel('K')
plt.ylabel('Error Rate')
We see that the value 19 is the value with the least error. Thus, We will take n_neighbors to be 25 for our model.
knn_classifier = KNeighborsClassifier(n_neighbors=25)
knn_classifier.fit(x_train, y_train)
knn_train_score = knn_classifier.score(x_train, y_train)
knn_test_score = knn_classifier.score(x_test, y_test)
print('Train score: {}\nTest score: {}'.format(knn_train_score, knn_test_score))
knn_prediction = knn_classifier.predict(x_test)
knn_classifier_mse = mean_squared_error(y_test, knn_prediction)
knn_classifier_rmse = np.sqrt(knn_classifier_mse)
print('MSE: {}\nRMSE: {}'.format(knn_classifier_mse, knn_classifier_rmse))
We see the KNN did a little better than Logistic regression. This can be because of the number of features we have. We shall try other algorithms and see if they perform better.
svc = SVC(kernel='rbf')
svc.fit(x_train, y_train)
svc_train_score = svc.score(x_train, y_train)
svc_test_score = svc.score(x_test, y_test)
print('Train score: {}\nTest score: {}'.format(svc_train_score, svc_test_score))
svc_prediction = svc.predict(x_test)
svc_mse = mean_squared_error(y_test, svc_prediction)
svc_rmse = np.sqrt(svc_mse)
print('MSE: {}\nRMSE: {}'.format(svc_mse, svc_rmse))
Again, the accuracy here is pretty low. Maybe we need to revisit our feature extraction part again.
dtree_classifier = DecisionTreeClassifier(min_impurity_decrease = 0.05)
dtree_classifier.fit(x_train, y_train)
dtree_train_score = dtree_classifier.score(x_train, y_train)
dtree_test_score = dtree_classifier.score(x_test, y_test)
print('Train score: {}\nTest score: {}'.format(dtree_train_score, dtree_test_score))
dtree_prediction = dtree_classifier.predict(x_test)
dtree_mse = mean_squared_error(y_test, svc_prediction)
dtree_rmse = np.sqrt(dtree_mse)
print('MSE: {}\nRMSE: {}'.format(dtree_mse, dtree_rmse))
adaboost_classifier = AdaBoostClassifier(n_estimators=3)
adaboost_classifier.fit(x_train,y_train)
adaboost_train_score = adaboost_classifier.score(x_train,y_train)
adaboost_test_score = adaboost_classifier.score(x_test,y_test)
print('Train score: {}\nTest score: {}'.format(adaboost_train_score, adaboost_test_score))
adaboost_prediction = adaboost_classifier.predict(x_test)
adaboost_mse = mean_squared_error(y_test, adaboost_prediction)
adaboost_rmse = np.sqrt(adaboost_mse)
print('MSE: {}\nRMSE: {}'.format(adaboost_mse, adaboost_rmse))
random_forest_classifier = RandomForestClassifier(n_estimators=20, min_samples_split=15, min_impurity_decrease=0.05)
random_forest_classifier.fit(x_train, y_train)
random_forest_train_score = random_forest_classifier.score(x_train,y_train)
random_forest_test_score = random_forest_classifier.score(x_test,y_test)
print('Train score: {}\nTest score: {}'.format(random_forest_train_score, random_forest_test_score))
random_forest_prediction = random_forest_classifier.predict(x_test)
random_forest_mse = mean_squared_error(y_test, random_forest_prediction)
random_forest_rmse = np.sqrt(random_forest_mse)
print('MSE: {}\nRMSE: {}'.format(random_forest_mse, random_forest_rmse))
As soon as we look at the dataset we realize that this is a US based survey. Mostly, people of the white and black ethinicity took part in the survey but other ethinicities were present as well. There was less biased data when it came to people belonging to the Asian-Pac-Island group where the ratio between people earning more than 50K and those earning less than the same was lower compared to other ethinicities. This dataset also containes information about more males than females. This may be because females did not prefer to take the survey. Also, this dataset is biased towards people making <=50K USD.
As we went through with the analysis, we found many interesting things. Most people usually go and find work right after their Highschool. However, some persue bachelors or higher studies like masters and doctorate or specializations tend to earn more. Some people don't even make it through highschool and these people almost always earn less than 50K which might be because of a lack of skill, education, exposure or more.
We noticed that there is barely any capital gain or capital loss for most people. Which leads us to believe that there is not a lot of growth in economy. However, the amount of gain people make is overwhelming compared to the amount they lose.
With our heatmap, we saw no mathematical correlation but from the other analysis methods we found some insightful information. From this survey we see that a lot of women earn less than 50K. It's just not women, but minorities in Race also seem to earn less.
People tend to work 40 hour weeks but it is not unusual to see people working a lot more or a lot less. And the age group of people working range from a young age of 17 to the age of over 90. It is interesting as people that old work as well. Both of these features combined give tell us that people older than 60 usually tend to work less. Most people who earn more than 50k either work long weeks or short weeks.
We see That none of the models perform very well and so we did not check for any other metric like classification report or the confusion matrix. This may be because of the features we chose. There is very high variance within the data which meant that we should not use PCA for dimentionality reduction as it would only choose features with high variance which doesn't necessarily mean that those features have anything to do with our target. We will have to use Neural networks or better feature extraction methods to get better results.