UCI Adult Data for Earning Potential of people


Determine if the person has an earning potential of more than 50K USD or less.

Importing libraries

In [1]:
import numpy as np
import pandas as pd

#visualization libraries
import matplotlib.pyplot as plt
import seaborn as sns

#preprocessing
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import MinMaxScaler

#Checking Z value for outlier treatment
from scipy import stats

from sklearn.decomposition import PCA

#splitting the dataset into training and testing. 60:20:20 
# 60% for training, 20% for validation and 20% for testing.
from sklearn.model_selection import train_test_split

#picking models for prediction.
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC

#ensemble models for better performance
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import RandomForestClassifier

#error evaluation
from sklearn.metrics import mean_squared_error

#ignore warning to make notebook prettier
import warnings
warnings.filterwarnings('ignore')

%matplotlib inline
In [2]:
#displays all rows and all columns without cutting anything.
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

Importing the dataset

In [3]:
path = '../dataset/adult-earning-potential/adult_data.csv'
adult_data = pd.read_csv(path)
adult_data.head().T
Out[3]:
0 1 2 3 4
39 50 38 53 28 37
State-gov Self-emp-not-inc Private Private Private Private
77516 83311 215646 234721 338409 284582
Bachelors Bachelors HS-grad 11th Bachelors Masters
13 13 9 7 13 14
Never-married Married-civ-spouse Divorced Married-civ-spouse Married-civ-spouse Married-civ-spouse
Adm-clerical Exec-managerial Handlers-cleaners Handlers-cleaners Prof-specialty Exec-managerial
Not-in-family Husband Not-in-family Husband Wife Wife
White White White Black Black White
Male Male Male Male Female Female
2174 0 0 0 0 0
0 0 0 0 0 0
40 13 40 40 40 40
United-States United-States United-States United-States Cuba United-States
<=50K <=50K <=50K <=50K <=50K <=50K

We see that the data has no head column. The headers are in another csv file. So, we will read that file and add it to the top of this csv file.

In [4]:
adult_header = pd.read_csv('../dataset/adult-earning-potential/adult_names_head.csv')
print(adult_header)
col_names = list(adult_header.columns)
print('col_names: {} \n length: {}'.format(col_names, len(col_names)))
Empty DataFrame
Columns: [age, workclass, fnlwgt, education, education-num, marital-status, occupation, capital-gain, capital-loss, hours-per-week, native-country]
Index: []
col_names: ['age', 'workclass', 'fnlwgt', 'education', 'education-num', 'marital-status', 'occupation', 'capital-gain', 'capital-loss', 'hours-per-week', 'native-country'] 
 length: 11
In [5]:
len(list(adult_data[0:1:-1])) #checks the number of columns
Out[5]:
15

We see that our header file is missing 4 columns as the length of the columns of the header and the data file are 11 and 15 respectively. We have to recheck the datasets to see which columns we are missing.


After checking, we see that the final column header is:
['Age','Workclass','fnlwgt','Education','Education-num','Marital_Status','Occupation','Relationship','Race','Sex','Capital-gain','Capital-loss','hrs_per_week','Native-Country','Earning_potential']
We will read our data and add this list as our header.

In [6]:
data_header = ['Age','Workclass','fnlwgt','Education','Education-num','Marital_Status','Occupation','Relationship','Race','Sex','Capital-gain','Capital-loss','hrs_per_week','Native-Country','Earning_potential']
adult_data = pd.read_csv(path, names = data_header)
adult_data.head()
Out[6]:
Age Workclass fnlwgt Education Education-num Marital_Status Occupation Relationship Race Sex Capital-gain Capital-loss hrs_per_week Native-Country Earning_potential
0 39 State-gov 77516 Bachelors 13 Never-married Adm-clerical Not-in-family White Male 2174 0 40 United-States <=50K
1 50 Self-emp-not-inc 83311 Bachelors 13 Married-civ-spouse Exec-managerial Husband White Male 0 0 13 United-States <=50K
2 38 Private 215646 HS-grad 9 Divorced Handlers-cleaners Not-in-family White Male 0 0 40 United-States <=50K
3 53 Private 234721 11th 7 Married-civ-spouse Handlers-cleaners Husband Black Male 0 0 40 United-States <=50K
4 28 Private 338409 Bachelors 13 Married-civ-spouse Prof-specialty Wife Black Female 0 0 40 Cuba <=50K
In [7]:
adult_data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32561 entries, 0 to 32560
Data columns (total 15 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   Age                32561 non-null  int64 
 1   Workclass          32561 non-null  object
 2   fnlwgt             32561 non-null  int64 
 3   Education          32561 non-null  object
 4   Education-num      32561 non-null  int64 
 5   Marital_Status     32561 non-null  object
 6   Occupation         32561 non-null  object
 7   Relationship       32561 non-null  object
 8   Race               32561 non-null  object
 9   Sex                32561 non-null  object
 10  Capital-gain       32561 non-null  int64 
 11  Capital-loss       32561 non-null  int64 
 12  hrs_per_week       32561 non-null  int64 
 13  Native-Country     32561 non-null  object
 14  Earning_potential  32561 non-null  object
dtypes: int64(6), object(9)
memory usage: 2.6+ MB

We see that there are no null columns. But sometimes, the NaN values are replaced with other special characters. We need to check for this. We will generate random samples of a small size to see if we find any special characters in the data.

In [8]:
adult_data.sample(50)
Out[8]:
Age Workclass fnlwgt Education Education-num Marital_Status Occupation Relationship Race Sex Capital-gain Capital-loss hrs_per_week Native-Country Earning_potential
3415 61 Private 231323 Some-college 10 Married-civ-spouse Craft-repair Husband White Male 0 0 40 United-States <=50K
10240 29 Private 162002 HS-grad 9 Never-married Other-service Own-child White Female 0 0 35 United-States <=50K
15908 22 Local-gov 163205 HS-grad 9 Never-married Other-service Own-child White Female 0 0 53 United-States <=50K
24419 35 Private 225860 Assoc-voc 11 Married-civ-spouse Craft-repair Husband White Male 0 0 40 United-States >50K
12670 77 Private 253642 7th-8th 4 Married-civ-spouse Machine-op-inspct Husband Other Male 0 0 30 United-States <=50K
29167 20 Private 192711 Some-college 10 Never-married Handlers-cleaners Own-child White Female 0 0 40 United-States <=50K
18487 34 Private 207301 HS-grad 9 Divorced Machine-op-inspct Unmarried White Female 0 0 20 United-States <=50K
5044 55 Private 31905 Masters 14 Married-civ-spouse Exec-managerial Husband White Male 0 1977 40 United-States >50K
23650 55 Private 82098 HS-grad 9 Married-civ-spouse Exec-managerial Husband Asian-Pac-Islander Male 0 0 55 United-States <=50K
28867 20 Private 282604 Some-college 10 Never-married Handlers-cleaners Other-relative White Male 0 0 20 United-States <=50K
27527 34 Private 173495 Some-college 10 Married-civ-spouse Craft-repair Husband White Male 0 0 48 United-States >50K
2696 18 Private 57413 Some-college 10 Divorced Other-service Own-child White Male 0 0 15 United-States <=50K
19087 31 Self-emp-not-inc 111423 Bachelors 13 Never-married Craft-repair Not-in-family White Male 0 0 55 United-States <=50K
6260 42 Private 341204 HS-grad 9 Divorced Craft-repair Other-relative White Female 0 0 40 United-States <=50K
32000 25 Private 237065 Some-college 10 Divorced Other-service Own-child Black Male 0 0 38 United-States <=50K
15037 46 Self-emp-not-inc 275625 Bachelors 13 Divorced Other-service Unmarried Asian-Pac-Islander Female 0 0 60 South >50K
903 27 ? 90270 Assoc-acdm 12 Married-civ-spouse ? Own-child Amer-Indian-Eskimo Male 0 0 40 United-States <=50K
11461 53 Private 548580 Some-college 10 Married-civ-spouse Craft-repair Husband White Male 0 0 40 Guatemala <=50K
9938 36 ? 36635 Some-college 10 Never-married ? Unmarried White Female 0 0 25 United-States <=50K
15929 32 Private 119124 Bachelors 13 Never-married Prof-specialty Own-child White Male 0 0 40 United-States <=50K
31340 24 Private 211160 Some-college 10 Married-civ-spouse Sales Husband White Male 0 0 40 United-States <=50K
23986 31 Private 352465 Some-college 10 Married-civ-spouse Exec-managerial Husband White Male 15024 0 50 United-States >50K
3497 35 Self-emp-not-inc 168475 Some-college 10 Married-civ-spouse Craft-repair Husband White Male 0 0 60 United-States >50K
13847 23 Private 189017 Bachelors 13 Never-married Sales Not-in-family White Male 0 0 55 United-States <=50K
27418 49 Federal-gov 179869 HS-grad 9 Married-civ-spouse Craft-repair Husband White Male 0 0 40 United-States >50K
24136 55 Federal-gov 305850 Prof-school 15 Married-civ-spouse Exec-managerial Husband White Male 15024 0 40 United-States >50K
23455 70 Private 304570 Bachelors 13 Widowed Machine-op-inspct Other-relative Asian-Pac-Islander Male 0 0 32 Philippines <=50K
1002 38 State-gov 343642 HS-grad 9 Married-civ-spouse Prof-specialty Wife White Female 0 0 40 United-States >50K
5788 27 ? 308995 Some-college 10 Divorced ? Own-child Black Female 0 0 40 United-States <=50K
13184 31 Private 159737 HS-grad 9 Never-married Sales Unmarried Black Female 0 0 58 United-States <=50K
22115 43 Private 160674 HS-grad 9 Divorced Exec-managerial Unmarried White Female 0 0 40 United-States <=50K
18115 28 State-gov 155397 Bachelors 13 Never-married Exec-managerial Not-in-family White Male 0 0 55 United-States <=50K
30182 74 Self-emp-not-inc 292915 Bachelors 13 Married-civ-spouse Exec-managerial Husband White Male 0 1825 12 United-States >50K
30971 41 Local-gov 33658 Some-college 10 Married-civ-spouse Protective-serv Husband White Male 0 0 45 United-States >50K
18366 74 Self-emp-not-inc 192413 Prof-school 15 Divorced Prof-specialty Other-relative White Male 0 0 40 United-States <=50K
23411 31 State-gov 207505 Doctorate 16 Married-civ-spouse Prof-specialty Husband White Male 0 1977 70 United-States >50K
14639 25 Private 108317 Bachelors 13 Never-married Exec-managerial Not-in-family White Male 0 0 40 United-States <=50K
29154 34 Private 101510 Bachelors 13 Divorced Sales Not-in-family White Female 0 0 50 United-States >50K
7296 31 State-gov 75755 Doctorate 16 Married-civ-spouse Exec-managerial Husband White Male 7298 0 55 United-States >50K
20303 24 Private 285457 HS-grad 9 Never-married Other-service Not-in-family White Male 0 0 40 United-States <=50K
32082 49 Private 23776 Some-college 10 Married-civ-spouse Adm-clerical Husband White Male 0 0 40 United-States <=50K
24958 24 Local-gov 387108 Some-college 10 Married-civ-spouse Protective-serv Husband Black Male 0 0 40 United-States <=50K
24005 49 Private 64216 HS-grad 9 Divorced Transport-moving Not-in-family White Male 0 0 90 United-States <=50K
18074 34 Federal-gov 436341 Some-college 10 Married-AF-spouse Adm-clerical Wife White Female 0 0 40 United-States >50K
29192 52 Private 165681 Some-college 10 Married-civ-spouse Craft-repair Husband White Male 0 0 40 United-States >50K
4387 52 Private 117700 HS-grad 9 Divorced Adm-clerical Not-in-family White Female 0 0 40 United-States <=50K
811 22 ? 219941 Some-college 10 Never-married ? Own-child Black Male 0 0 40 United-States <=50K
17055 26 Private 182308 Some-college 10 Married-civ-spouse Prof-specialty Husband White Male 0 0 40 United-States <=50K
31762 61 Private 668362 1st-4th 2 Widowed Handlers-cleaners Not-in-family White Female 0 0 40 United-States <=50K
18133 30 Private 207301 HS-grad 9 Never-married Sales Not-in-family White Female 0 1980 40 United-States <=50K

From the random samples, we see that there are '?' in the dataset. These mean Null values. So we will search for the columns with '?' and then deal with them.

In [9]:
adult_data = adult_data.replace(to_replace = '%?%', value = np.nan) #replaces everything with a '?' with Nan
adult_data.isna().sum()
Out[9]:
Age                  0
Workclass            0
fnlwgt               0
Education            0
Education-num        0
Marital_Status       0
Occupation           0
Relationship         0
Race                 0
Sex                  0
Capital-gain         0
Capital-loss         0
hrs_per_week         0
Native-Country       0
Earning_potential    0
dtype: int64

We now need to treat the missing data. For this, we will check the columns and replace those values with the Mode if they are categorical (if the mode is really high compared to other values) or the median if the value is numeric.
Before that we need to check for the types of data in each column (categorical or numeric)

In [10]:
all_columns = list(adult_data.columns)
print('all_columns:\n {}'.format(all_columns))

categorical_columns = list(adult_data.select_dtypes(include=['object']).columns)
print('Categorical columns:\n {}'.format(categorical_columns))

numerical_columns = list(adult_data.select_dtypes(include=['int64', 'float64']).columns)
print('Numerical columns:\n {}'.format(numerical_columns))
all_columns:
 ['Age', 'Workclass', 'fnlwgt', 'Education', 'Education-num', 'Marital_Status', 'Occupation', 'Relationship', 'Race', 'Sex', 'Capital-gain', 'Capital-loss', 'hrs_per_week', 'Native-Country', 'Earning_potential']
Categorical columns:
 ['Workclass', 'Education', 'Marital_Status', 'Occupation', 'Relationship', 'Race', 'Sex', 'Native-Country', 'Earning_potential']
Numerical columns:
 ['Age', 'fnlwgt', 'Education-num', 'Capital-gain', 'Capital-loss', 'hrs_per_week']

Now that we have checked the data and have found the numeric and categorical columns, we can now proceed with the analysis.

Data Exploration

In [11]:
adult_data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32561 entries, 0 to 32560
Data columns (total 15 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   Age                32561 non-null  int64 
 1   Workclass          32561 non-null  object
 2   fnlwgt             32561 non-null  int64 
 3   Education          32561 non-null  object
 4   Education-num      32561 non-null  int64 
 5   Marital_Status     32561 non-null  object
 6   Occupation         32561 non-null  object
 7   Relationship       32561 non-null  object
 8   Race               32561 non-null  object
 9   Sex                32561 non-null  object
 10  Capital-gain       32561 non-null  int64 
 11  Capital-loss       32561 non-null  int64 
 12  hrs_per_week       32561 non-null  int64 
 13  Native-Country     32561 non-null  object
 14  Earning_potential  32561 non-null  object
dtypes: int64(6), object(9)
memory usage: 2.6+ MB

As we had replaced the'?' with null values, we see that the non-null sizes of each column is different. We have already calculated the null values of those missing columns. We will work on them too.

In [12]:
adult_data.describe().T
Out[12]:
count mean std min 25% 50% 75% max
Age 32561.0 38.581647 13.640433 17.0 28.0 37.0 48.0 90.0
fnlwgt 32561.0 189778.366512 105549.977697 12285.0 117827.0 178356.0 237051.0 1484705.0
Education-num 32561.0 10.080679 2.572720 1.0 9.0 10.0 12.0 16.0
Capital-gain 32561.0 1077.648844 7385.292085 0.0 0.0 0.0 0.0 99999.0
Capital-loss 32561.0 87.303830 402.960219 0.0 0.0 0.0 0.0 4356.0
hrs_per_week 32561.0 40.437456 12.347429 1.0 40.0 40.0 45.0 99.0

From the 5 number summary of the describe function, we see that the data in Capital-Gain and Capital-loss are vastly spread out with really high variance. This either means the data contains a large portion of the information, or, it has a lot of extreme values. In our case, these contain extreme values and these are 0 (zero) in a lot of rows.
The average of the age column is 38.58 and the median is 37 (50th percentile) implying that these people have families to take care of. This may be a factor to consider when dealing with the performance of the individual.

fnlwgt is the weight of people which usually depends on the demography they belong to so we should consider it with the country they belong to.

Hours of work per week also depends on the country and the field the person belongs to. Self employed people may work more or less hours than employees. Profession also determines the hours of work per week.


We will now explore the categorical columns.

In [13]:
len(categorical_columns)
Out[13]:
9

Categorical Column Analysis


We will be checking for the count of the categorical data and getting inferences from it.

In [14]:
plt.figure(figsize = (15,10))
sns.countplot(adult_data['Workclass'])
plt.show()

We realize that most of the people in the survey belong to the Private sector. This is a biased data as we barely have any information on the other kind of workclasses.

In [15]:
adult_data['Workclass'].value_counts()
Out[15]:
 Private             22696
 Self-emp-not-inc     2541
 Local-gov            2093
 ?                    1836
 State-gov            1298
 Self-emp-inc         1116
 Federal-gov           960
 Without-pay            14
 Never-worked            7
Name: Workclass, dtype: int64
In [16]:
adult_data.groupby(['Workclass', 'Earning_potential']).size()
Out[16]:
Workclass          Earning_potential
 ?                  <=50K                1645
                    >50K                  191
 Federal-gov        <=50K                 589
                    >50K                  371
 Local-gov          <=50K                1476
                    >50K                  617
 Never-worked       <=50K                   7
 Private            <=50K               17733
                    >50K                 4963
 Self-emp-inc       <=50K                 494
                    >50K                  622
 Self-emp-not-inc   <=50K                1817
                    >50K                  724
 State-gov          <=50K                 945
                    >50K                  353
 Without-pay        <=50K                  14
dtype: int64

Everyone in the survey who has worked without pay earns <=50 because... well... They aren't paid xD
Same goes for people who have never worked. Maybe their families are filthy rich.
We do see that the private sector that employes most of the people in the survey also has over 3 times the number of employees who earn less than 50k compared to their counterparts who earn more than 50k.
This may be because our data is biased.

In [17]:
plt.figure(figsize = (15,10))
sns.countplot(adult_data[categorical_columns[1]])
plt.show()

This is an interesting graph as most of the people in the survey have either attended HS-grad, have a Bachelor's degree or have attended some collage.

In [18]:
adult_data['Education'].value_counts()
Out[18]:
 HS-grad         10501
 Some-college     7291
 Bachelors        5355
 Masters          1723
 Assoc-voc        1382
 11th             1175
 Assoc-acdm       1067
 10th              933
 7th-8th           646
 Prof-school       576
 9th               514
 12th              433
 Doctorate         413
 5th-6th           333
 1st-4th           168
 Preschool          51
Name: Education, dtype: int64
In [19]:
adult_data.groupby(['Education', 'Earning_potential']).size()
Out[19]:
Education      Earning_potential
 10th           <=50K                871
                >50K                  62
 11th           <=50K               1115
                >50K                  60
 12th           <=50K                400
                >50K                  33
 1st-4th        <=50K                162
                >50K                   6
 5th-6th        <=50K                317
                >50K                  16
 7th-8th        <=50K                606
                >50K                  40
 9th            <=50K                487
                >50K                  27
 Assoc-acdm     <=50K                802
                >50K                 265
 Assoc-voc      <=50K               1021
                >50K                 361
 Bachelors      <=50K               3134
                >50K                2221
 Doctorate      <=50K                107
                >50K                 306
 HS-grad        <=50K               8826
                >50K                1675
 Masters        <=50K                764
                >50K                 959
 Preschool      <=50K                 51
 Prof-school    <=50K                153
                >50K                 423
 Some-college   <=50K               5904
                >50K                1387
dtype: int64

From this we understand that most people with lower level of education usually earn less than 50k. There are exceptions however and that may be so because of their experiences or learning things on their own. Or just being really good at their trade.
Only for bachelors we see the gap between people earning >50k and <=50k is fairly low. That may be so because bachelors is a degree with many trades involved and talent and hardwork usually pays off real well.
Everyone who has not studied beyond preschool earns less than 50k
On the other hand people who have persued higher education like Masters and Doctorate are more likely to earn >50k. We need to analyze other factors to trim this down.

In [20]:
adult_data.groupby(['Education', 'Workclass']).size()
Out[20]:
Education      Workclass        
 10th           ?                    100
                Federal-gov            6
                Local-gov             31
                Never-worked           2
                Private              695
                Self-emp-inc          19
                Self-emp-not-inc      67
                State-gov             13
 11th           ?                    118
                Federal-gov            9
                Local-gov             36
                Never-worked           1
                Private              923
                Self-emp-inc          14
                Self-emp-not-inc      60
                State-gov             14
 12th           ?                     40
                Federal-gov            5
                Local-gov             19
                Private              333
                Self-emp-inc           7
                Self-emp-not-inc      19
                State-gov             10
 1st-4th        ?                     12
                Local-gov              4
                Private              136
                Self-emp-inc           2
                Self-emp-not-inc      13
                State-gov              1
 5th-6th        ?                     30
                Federal-gov            1
                Local-gov              9
                Private              266
                Self-emp-inc           4
                Self-emp-not-inc      19
                State-gov              4
 7th-8th        ?                     72
                Federal-gov            2
                Local-gov             28
                Never-worked           1
                Private              424
                Self-emp-inc          14
                Self-emp-not-inc      94
                State-gov             10
                Without-pay            1
 9th            ?                     51
                Federal-gov            3
                Local-gov             23
                Private              387
                Self-emp-inc          10
                Self-emp-not-inc      34
                State-gov              6
 Assoc-acdm     ?                     47
                Federal-gov           55
                Local-gov             88
                Private              729
                Self-emp-inc          35
                Self-emp-not-inc      71
                State-gov             41
                Without-pay            1
 Assoc-voc      ?                     61
                Federal-gov           38
                Local-gov             86
                Private             1005
                Self-emp-inc          38
                Self-emp-not-inc     108
                State-gov             46
 Bachelors      ?                    173
                Federal-gov          212
                Local-gov            477
                Private             3551
                Self-emp-inc         273
                Self-emp-not-inc     399
                State-gov            270
 Doctorate      ?                     15
                Federal-gov           16
                Local-gov             27
                Private              181
                Self-emp-inc          35
                Self-emp-not-inc      50
                State-gov             89
 HS-grad        ?                    532
                Federal-gov          263
                Local-gov            503
                Never-worked           1
                Private             7780
                Self-emp-inc         279
                Self-emp-not-inc     866
                State-gov            268
                Without-pay            9
 Masters        ?                     48
                Federal-gov           67
                Local-gov            342
                Private              894
                Self-emp-inc          79
                Self-emp-not-inc     124
                State-gov            169
 Preschool      ?                      5
                Local-gov              4
                Private               41
                State-gov              1
 Prof-school    ?                     18
                Federal-gov           29
                Local-gov             29
                Private              257
                Self-emp-inc          81
                Self-emp-not-inc     131
                State-gov             31
 Some-college   ?                    514
                Federal-gov          254
                Local-gov            387
                Never-worked           2
                Private             5094
                Self-emp-inc         226
                Self-emp-not-inc     486
                State-gov            325
                Without-pay            3
dtype: int64

This gives us insigts on the relationship between workclass and education.
we see that the only people who work without pay are the ones who had Assoc-acdm, are HS-Grads or those who went to some collage.
We also see that no matter the educational level, the highest number of people in each category work in the private sector.
The mode of each of those columns is of the private sector as well.

In [21]:
plt.figure(figsize = (15,10))
sns.countplot(adult_data[categorical_columns[2]])
plt.show()
In [22]:
adult_data['Marital_Status'].value_counts()
Out[22]:
 Married-civ-spouse       14976
 Never-married            10683
 Divorced                  4443
 Separated                 1025
 Widowed                    993
 Married-spouse-absent      418
 Married-AF-spouse           23
Name: Marital_Status, dtype: int64
In [23]:
adult_data.groupby(['Marital_Status', 'Earning_potential']).size()
Out[23]:
Marital_Status          Earning_potential
 Divorced                <=50K                3980
                         >50K                  463
 Married-AF-spouse       <=50K                  13
                         >50K                   10
 Married-civ-spouse      <=50K                8284
                         >50K                 6692
 Married-spouse-absent   <=50K                 384
                         >50K                   34
 Never-married           <=50K               10192
                         >50K                  491
 Separated               <=50K                 959
                         >50K                   66
 Widowed                 <=50K                 908
                         >50K                   85
dtype: int64

We see that when a person is married and has a civilian spouce, they tend to have a lower difference between the two classses of income potential.

The biggest difference is among the people who are never married. Almost all of them earn less than 50k. This may be because they are relatively younger so they have less experience or there can be a range of other factors (like education).

In [24]:
adult_data.groupby(['Marital_Status', 'Workclass']).size()
Out[24]:
Marital_Status          Workclass        
 Divorced                ?                    184
                         Federal-gov          168
                         Local-gov            369
                         Never-worked           1
                         Private             3119
                         Self-emp-inc         100
                         Self-emp-not-inc     292
                         State-gov            210
 Married-AF-spouse       ?                      2
                         Federal-gov            3
                         Private               15
                         Self-emp-not-inc       2
                         State-gov              1
 Married-civ-spouse      ?                    636
                         Federal-gov          471
                         Local-gov           1023
                         Never-worked           1
                         Private             9732
                         Self-emp-inc         837
                         Self-emp-not-inc    1680
                         State-gov            588
                         Without-pay            8
 Married-spouse-absent   ?                     29
                         Federal-gov           11
                         Local-gov             22
                         Private              302
                         Self-emp-inc           5
                         Self-emp-not-inc      31
                         State-gov             17
                         Without-pay            1
 Never-married           ?                    766
                         Federal-gov          245
                         Local-gov            530
                         Never-worked           5
                         Private             8186
                         Self-emp-inc         125
                         Self-emp-not-inc     409
                         State-gov            413
                         Without-pay            4
 Separated               ?                     66
                         Federal-gov           26
                         Local-gov             63
                         Private              754
                         Self-emp-inc          20
                         Self-emp-not-inc      53
                         State-gov             43
 Widowed                 ?                    153
                         Federal-gov           36
                         Local-gov             86
                         Private              588
                         Self-emp-inc          29
                         Self-emp-not-inc      74
                         State-gov             26
                         Without-pay            1
dtype: int64

We again see that in all the marital statuses, private sector is where most people work. It usually is the case with an overwhelming majority.
Except for people with a married civilian spouse where local-govt and self-employed-not-inc had 1600+ and 1000+ entries with private sectors were at 9.7k.

In [25]:
adult_data.groupby(['Marital_Status', 'Education']).size()
Out[25]:
Marital_Status          Education    
 Divorced                10th             120
                         11th             130
                         12th              39
                         1st-4th           10
                         5th-6th           20
                         7th-8th           73
                         9th               64
                         Assoc-acdm       203
                         Assoc-voc        234
                         Bachelors        546
                         Doctorate         33
                         HS-grad         1613
                         Masters          233
                         Preschool          1
                         Prof-school       55
                         Some-college    1069
 Married-AF-spouse       Assoc-acdm         2
                         Assoc-voc          1
                         Bachelors          4
                         HS-grad           13
                         Some-college       3
 Married-civ-spouse      10th             349
                         11th             354
                         12th             130
                         1st-4th           81
                         5th-6th          172
                         7th-8th          359
                         9th              230
                         Assoc-acdm       460
                         Assoc-voc        689
                         Bachelors       2768
                         Doctorate        286
                         HS-grad         4845
                         Masters         1003
                         Preschool         20
                         Prof-school      412
                         Some-college    2818
 Married-spouse-absent   10th              15
                         11th              19
                         12th               8
                         1st-4th           12
                         5th-6th           20
                         7th-8th           14
                         9th                9
                         Assoc-acdm        12
                         Assoc-voc         13
                         Bachelors         68
                         Doctorate          7
                         HS-grad          121
                         Masters           17
                         Preschool          4
                         Prof-school        3
                         Some-college      76
 Never-married           10th             361
                         11th             586
                         12th             232
                         1st-4th           39
                         5th-6th           89
                         7th-8th          113
                         9th              155
                         Assoc-acdm       337
                         Assoc-voc        362
                         Bachelors       1795
                         Doctorate         73
                         HS-grad         3089
                         Masters          404
                         Preschool         22
                         Prof-school       93
                         Some-college    2933
 Separated               10th              49
                         11th              48
                         12th              14
                         1st-4th            9
                         5th-6th           18
                         7th-8th           23
                         9th               33
                         Assoc-acdm        30
                         Assoc-voc         42
                         Bachelors         92
                         Doctorate          7
                         HS-grad          406
                         Masters           25
                         Preschool          1
                         Prof-school        8
                         Some-college     220
 Widowed                 10th              39
                         11th              38
                         12th              10
                         1st-4th           17
                         5th-6th           14
                         7th-8th           64
                         9th               23
                         Assoc-acdm        23
                         Assoc-voc         41
                         Bachelors         82
                         Doctorate          7
                         HS-grad          414
                         Masters           41
                         Preschool          3
                         Prof-school        5
                         Some-college     172
dtype: int64

Most people who did their masters, bachelors or went to some collage tend to have civilian spouses. Bachelors and people who went to some collage are also amongst those who were not married.
This may be because of the number of people with bachelors, hs-grads and people going to some collage are higher than most other recordings. However, a lot of the people who did their masters did prefer civilian spouses.

In [26]:
plt.figure(figsize=(25,15))
sns.countplot(adult_data['Native-Country'])
plt.show()
In [27]:
adult_data['Native-Country'].value_counts()
Out[27]:
 United-States                 29170
 Mexico                          643
 ?                               583
 Philippines                     198
 Germany                         137
 Canada                          121
 Puerto-Rico                     114
 El-Salvador                     106
 India                           100
 Cuba                             95
 England                          90
 Jamaica                          81
 South                            80
 China                            75
 Italy                            73
 Dominican-Republic               70
 Vietnam                          67
 Guatemala                        64
 Japan                            62
 Poland                           60
 Columbia                         59
 Taiwan                           51
 Haiti                            44
 Iran                             43
 Portugal                         37
 Nicaragua                        34
 Peru                             31
 France                           29
 Greece                           29
 Ecuador                          28
 Ireland                          24
 Hong                             20
 Trinadad&Tobago                  19
 Cambodia                         19
 Laos                             18
 Thailand                         18
 Yugoslavia                       16
 Outlying-US(Guam-USVI-etc)       14
 Hungary                          13
 Honduras                         13
 Scotland                         12
 Holand-Netherlands                1
Name: Native-Country, dtype: int64

That is a highly biased dataset. So we can just replace all the null values with US and it will be fine.
We also notice that there is no inconsistency in the data (like US, USA, United States in one dataset) so we do not need to worry about it.

In [28]:
adult_data['Occupation'].value_counts()
Out[28]:
 Prof-specialty       4140
 Craft-repair         4099
 Exec-managerial      4066
 Adm-clerical         3770
 Sales                3650
 Other-service        3295
 Machine-op-inspct    2002
 ?                    1843
 Transport-moving     1597
 Handlers-cleaners    1370
 Farming-fishing       994
 Tech-support          928
 Protective-serv       649
 Priv-house-serv       149
 Armed-Forces            9
Name: Occupation, dtype: int64

Occupation does not have a biased report so we can check for it's connections with other columns to see if we can replace the null values with anything.

In [29]:
adult_data.groupby(['Occupation', 'Education']).size()
Out[29]:
Occupation          Education    
 ?                   10th             102
                     11th             119
                     12th              40
                     1st-4th           12
                     5th-6th           30
                     7th-8th           73
                     9th               51
                     Assoc-acdm        47
                     Assoc-voc         61
                     Bachelors        173
                     Doctorate         15
                     HS-grad          533
                     Masters           48
                     Preschool          5
                     Prof-school       18
                     Some-college     516
 Adm-clerical        10th              38
                     11th              67
                     12th              38
                     5th-6th            6
                     7th-8th           11
                     9th               14
                     Assoc-acdm       193
                     Assoc-voc        167
                     Bachelors        506
                     Doctorate          5
                     HS-grad         1365
                     Masters           68
                     Preschool          2
                     Prof-school        9
                     Some-college    1281
 Armed-Forces        12th               1
                     Bachelors          1
                     HS-grad            4
                     Masters            1
                     Some-college       2
 Craft-repair        10th             170
                     11th             175
                     12th              58
                     1st-4th           23
                     5th-6th           43
                     7th-8th          116
                     9th               96
                     Assoc-acdm       115
                     Assoc-voc        252
                     Bachelors        226
                     Doctorate          2
                     HS-grad         1922
                     Masters           22
                     Preschool          4
                     Prof-school        7
                     Some-college     868
 Exec-managerial     10th              24
                     11th              34
                     12th              13
                     1st-4th            4
                     5th-6th            1
                     7th-8th           19
                     9th               13
                     Assoc-acdm       145
                     Assoc-voc        150
                     Bachelors       1369
                     Doctorate         55
                     HS-grad          807
                     Masters          501
                     Prof-school       52
                     Some-college     879
 Farming-fishing     10th              44
                     11th              37
                     12th              16
                     1st-4th           18
                     5th-6th           36
                     7th-8th           70
                     9th               28
                     Assoc-acdm        14
                     Assoc-voc         52
                     Bachelors         77
                     Doctorate          1
                     HS-grad          404
                     Masters           10
                     Preschool          9
                     Prof-school        4
                     Some-college     174
 Handlers-cleaners   10th              71
                     11th             123
                     12th              38
                     1st-4th           16
                     5th-6th           40
                     7th-8th           46
                     9th               49
                     Assoc-acdm        24
                     Assoc-voc         28
                     Bachelors         50
                     HS-grad          611
                     Masters            5
                     Preschool          2
                     Some-college     267
 Machine-op-inspct   10th             101
                     11th              99
                     12th              35
                     1st-4th           23
                     5th-6th           56
                     7th-8th           93
                     9th               76
                     Assoc-acdm        33
                     Assoc-voc         63
                     Bachelors         69
                     Doctorate          1
                     HS-grad         1023
                     Masters            8
                     Preschool         11
                     Prof-school        1
                     Some-college     310
 Other-service       10th             194
                     11th             238
                     12th              85
                     1st-4th           40
                     5th-6th           64
                     7th-8th           98
                     9th              101
                     Assoc-acdm        78
                     Assoc-voc        115
                     Bachelors        181
                     Doctorate          1
                     HS-grad         1281
                     Masters           19
                     Preschool         15
                     Prof-school        4
                     Some-college     781
 Priv-house-serv     10th               6
                     11th              14
                     12th               4
                     1st-4th           11
                     5th-6th           14
                     7th-8th            8
                     9th               10
                     Assoc-acdm         2
                     Assoc-voc          4
                     Bachelors          7
                     HS-grad           50
                     Masters            1
                     Preschool          2
                     Some-college      16
 Prof-specialty      10th               9
                     11th              20
                     12th              10
                     1st-4th            4
                     5th-6th            1
                     7th-8th            9
                     9th                3
                     Assoc-acdm       138
                     Assoc-voc        170
                     Bachelors       1495
                     Doctorate        321
                     HS-grad          233
                     Masters          844
                     Preschool          1
                     Prof-school      452
                     Some-college     430
 Protective-serv     10th               6
                     11th               7
                     12th               6
                     1st-4th            1
                     5th-6th            1
                     7th-8th            9
                     9th                4
                     Assoc-acdm        34
                     Assoc-voc         48
                     Bachelors        100
                     HS-grad          215
                     Masters           15
                     Prof-school        1
                     Some-college     202
 Sales               10th              81
                     11th             144
                     12th              47
                     1st-4th            8
                     5th-6th           12
                     7th-8th           29
                     9th               32
                     Assoc-acdm       144
                     Assoc-voc        106
                     Bachelors        809
                     Doctorate          8
                     HS-grad         1069
                     Masters          134
                     Prof-school       18
                     Some-college    1009
 Tech-support        10th               3
                     11th               6
                     12th               3
                     5th-6th            1
                     7th-8th            5
                     9th                2
                     Assoc-acdm        73
                     Assoc-voc        126
                     Bachelors        230
                     Doctorate          3
                     HS-grad          159
                     Masters           37
                     Prof-school        7
                     Some-college     273
 Transport-moving    10th              84
                     11th              92
                     12th              39
                     1st-4th            8
                     5th-6th           28
                     7th-8th           60
                     9th               35
                     Assoc-acdm        27
                     Assoc-voc         40
                     Bachelors         62
                     Doctorate          1
                     HS-grad          825
                     Masters           10
                     Prof-school        3
                     Some-college     283
dtype: int64

As we had expected, Bachelors, HS-Grads and some-collage rule this side as well. But there are some interesting things we found.
In Tech SUpport, Bachelors, Some-collage are mostly present with HS-grads coming in third followed by Assoc-voc.
Bachelors, Masters and Doctorates prefer Prof-speciality fields of work. Most masters and Doctorates work in this field. They are also the ones who usually get paid >50k so this makes a lot of sense.

In [30]:
adult_data.groupby(['Occupation', 'Workclass']).size()
Out[30]:
Occupation          Workclass        
 ?                   ?                   1836
                     Never-worked           7
 Adm-clerical        Federal-gov          317
                     Local-gov            283
                     Private             2833
                     Self-emp-inc          31
                     Self-emp-not-inc      50
                     State-gov            253
                     Without-pay            3
 Armed-Forces        Federal-gov            9
 Craft-repair        Federal-gov           64
                     Local-gov            146
                     Private             3195
                     Self-emp-inc         106
                     Self-emp-not-inc     531
                     State-gov             56
                     Without-pay            1
 Exec-managerial     Federal-gov          180
                     Local-gov            214
                     Private             2691
                     Self-emp-inc         400
                     Self-emp-not-inc     392
                     State-gov            189
 Farming-fishing     Federal-gov            8
                     Local-gov             29
                     Private              455
                     Self-emp-inc          51
                     Self-emp-not-inc     430
                     State-gov             15
                     Without-pay            6
 Handlers-cleaners   Federal-gov           23
                     Local-gov             47
                     Private             1273
                     Self-emp-inc           2
                     Self-emp-not-inc      15
                     State-gov              9
                     Without-pay            1
 Machine-op-inspct   Federal-gov           14
                     Local-gov             12
                     Private             1913
                     Self-emp-inc          13
                     Self-emp-not-inc      36
                     State-gov             13
                     Without-pay            1
 Other-service       Federal-gov           35
                     Local-gov            193
                     Private             2740
                     Self-emp-inc          27
                     Self-emp-not-inc     175
                     State-gov            124
                     Without-pay            1
 Priv-house-serv     Private              149
 Prof-specialty      Federal-gov          175
                     Local-gov            705
                     Private             2313
                     Self-emp-inc         160
                     Self-emp-not-inc     373
                     State-gov            414
 Protective-serv     Federal-gov           28
                     Local-gov            304
                     Private              190
                     Self-emp-inc           5
                     Self-emp-not-inc       6
                     State-gov            116
 Sales               Federal-gov           14
                     Local-gov              7
                     Private             2942
                     Self-emp-inc         291
                     Self-emp-not-inc     385
                     State-gov             11
 Tech-support        Federal-gov           68
                     Local-gov             38
                     Private              736
                     Self-emp-inc           3
                     Self-emp-not-inc      26
                     State-gov             57
 Transport-moving    Federal-gov           25
                     Local-gov            115
                     Private             1266
                     Self-emp-inc          27
                     Self-emp-not-inc     122
                     State-gov             41
                     Without-pay            1
dtype: int64

Here again, Private sector is the most preferred sector except for Farming-Fishing where self-emp-not-inc are present in abundence. Since they are usually not the ones who get paid >50k, farming is probably not a very profitable. Most of the people who work without pay also stay in this trade.

In [31]:
plt.figure(figsize=(15,10))
sns.countplot(adult_data['Race'])
Out[31]:
<matplotlib.axes._subplots.AxesSubplot at 0x14101430>
In [32]:
adult_data.groupby(['Race', 'Earning_potential']).size()
Out[32]:
Race                 Earning_potential
 Amer-Indian-Eskimo   <=50K                 275
                      >50K                   36
 Asian-Pac-Islander   <=50K                 763
                      >50K                  276
 Black                <=50K                2737
                      >50K                  387
 Other                <=50K                 246
                      >50K                   25
 White                <=50K               20699
                      >50K                 7117
dtype: int64

This is a white dominated dataset and we see that the ratio of race with >50k and <50k is: aie - 7.6 api - 2.7 b - 7.0 o - 9.8 w - 2.9

This concludes that White and Asian-Pac_Islanders have a less ratio between the people earning >50k and <=50k while it is more prominent in the rest of the races, especially for the people tagged 'Other'
This may be because of education. So let's check that.

In [33]:
adult_data.groupby(['Race', 'Education']).size()
Out[33]:
Race                 Education    
 Amer-Indian-Eskimo   10th              16
                      11th              14
                      12th               5
                      1st-4th            4
                      5th-6th            2
                      7th-8th            9
                      9th                5
                      Assoc-acdm         8
                      Assoc-voc         19
                      Bachelors         21
                      Doctorate          3
                      HS-grad          119
                      Masters            5
                      Prof-school        2
                      Some-college      79
 Asian-Pac-Islander   10th              13
                      11th              21
                      12th               9
                      1st-4th            5
                      5th-6th           18
                      7th-8th           11
                      9th                9
                      Assoc-acdm        29
                      Assoc-voc         38
                      Bachelors        289
                      Doctorate         28
                      HS-grad          226
                      Masters           88
                      Preschool          6
                      Prof-school       41
                      Some-college     208
 Black                10th             133
                      11th             153
                      12th              70
                      1st-4th           16
                      5th-6th           21
                      7th-8th           56
                      9th               89
                      Assoc-acdm       107
                      Assoc-voc        112
                      Bachelors        330
                      Doctorate         11
                      HS-grad         1174
                      Masters           86
                      Preschool          5
                      Prof-school       15
                      Some-college     746
 Other                10th               9
                      11th              10
                      12th              14
                      1st-4th            9
                      5th-6th           13
                      7th-8th           17
                      9th                8
                      Assoc-acdm         8
                      Assoc-voc          6
                      Bachelors         33
                      Doctorate          2
                      HS-grad           78
                      Masters            7
                      Preschool          2
                      Prof-school        4
                      Some-college      51
 White                10th             762
                      11th             977
                      12th             335
                      1st-4th          134
                      5th-6th          279
                      7th-8th          553
                      9th              403
                      Assoc-acdm       915
                      Assoc-voc       1207
                      Bachelors       4682
                      Doctorate        369
                      HS-grad         8904
                      Masters         1537
                      Preschool         38
                      Prof-school      514
                      Some-college    6207
dtype: int64

We see that Most white people are HS-Grads and a lot of them have gone to some collage. The next highs are Bachelors and masters.
We see that most people belonging to the Other category do not have many high degrees and that might explain the low income potential.
We see a lot of people belonging to the Black category are HS-Grads. Since HS-Grads have a high number of people with income potential <=50k, this may explain why the ratio is so high.
For asia-pac-islanders, a lot of people in their category have done masters (compared to other degrees it isnt that low) which may have contributed to their higher income potential.

In [34]:
plt.figure(figsize=(15,10))
sns.countplot(adult_data['Sex'])
Out[34]:
<matplotlib.axes._subplots.AxesSubplot at 0x14ae7f58>
In [35]:
adult_data.groupby(['Sex', 'Earning_potential']).size()
Out[35]:
Sex      Earning_potential
 Female   <=50K                9592
          >50K                 1179
 Male     <=50K               15128
          >50K                 6662
dtype: int64

We see that when it comes to the ratio of Males being paid >50k and <-50k is around 2.3 but for women, the same ratio drops down to 8.1. This implies that women are being paid less. We have to check the same for every occupation to see if women work in occupations that usually pay less. This can go both ways.

In [36]:
adult_data.groupby(['Education', 'Sex']).size()
Out[36]:
Education      Sex    
 10th           Female     295
                Male       638
 11th           Female     432
                Male       743
 12th           Female     144
                Male       289
 1st-4th        Female      46
                Male       122
 5th-6th        Female      84
                Male       249
 7th-8th        Female     160
                Male       486
 9th            Female     144
                Male       370
 Assoc-acdm     Female     421
                Male       646
 Assoc-voc      Female     500
                Male       882
 Bachelors      Female    1619
                Male      3736
 Doctorate      Female      86
                Male       327
 HS-grad        Female    3390
                Male      7111
 Masters        Female     536
                Male      1187
 Preschool      Female      16
                Male        35
 Prof-school    Female      92
                Male       484
 Some-college   Female    2806
                Male      4485
dtype: int64

We see that women are a lot more prominent in the lower levels of education and then in bachelors, hsgrads and some-collage. This implies a lot of women quit during school.

In [37]:
adult_data.groupby(['Occupation', 'Sex']).size()
Out[37]:
Occupation          Sex    
 ?                   Female     841
                     Male      1002
 Adm-clerical        Female    2537
                     Male      1233
 Armed-Forces        Male         9
 Craft-repair        Female     222
                     Male      3877
 Exec-managerial     Female    1159
                     Male      2907
 Farming-fishing     Female      65
                     Male       929
 Handlers-cleaners   Female     164
                     Male      1206
 Machine-op-inspct   Female     550
                     Male      1452
 Other-service       Female    1800
                     Male      1495
 Priv-house-serv     Female     141
                     Male         8
 Prof-specialty      Female    1515
                     Male      2625
 Protective-serv     Female      76
                     Male       573
 Sales               Female    1263
                     Male      2387
 Tech-support        Female     348
                     Male       580
 Transport-moving    Female      90
                     Male      1507
dtype: int64

Women dominate the adm-clerical, other services, priv-house-ervices. However in all the services, a majority of people get paid <=50k.
We may need to check the numeric columns and then plot a heatmap and check the corelation of all the columns with the target in order to get a better idea about the dataset.

Numerical Column Analysis

In [38]:
numerical_columns
Out[38]:
['Age',
 'fnlwgt',
 'Education-num',
 'Capital-gain',
 'Capital-loss',
 'hrs_per_week']
In [39]:
# for i in range(len(numerical_columns)):
#     plt.figure(figsize=(15,10))
#     sns.distplot(adult_data[numerical_columns[i]])
# plt.show() 

# Too many graphs
In [40]:
# for i in range(len(numerical_columns)):
#     plt.figure(figsize=(15,10))
#     sns.boxplot(adult_data[numerical_columns[i]])
# plt.show() 

# Too many gtaphs
In [41]:
adult_data.var(axis=0)
adult_data.loc[:, numerical_columns].var()
Out[41]:
Age              1.860614e+02
fnlwgt           1.114080e+10
Education-num    6.618890e+00
Capital-gain     5.454254e+07
Capital-loss     1.623769e+05
hrs_per_week     1.524590e+02
dtype: float64

let's make our lives easier and convert it to a float value.

In [42]:
var_in_float = adult_data.loc[:, numerical_columns].var()
for i in range(len(numerical_columns)):
    print('{} \t\t {}'.format(numerical_columns[i], round(float(var_in_float[i]), 3)))
Age 		 186.061
fnlwgt 		 11140797791.842
Education-num 		 6.619
Capital-gain 		 54542539.178
Capital-loss 		 162376.938
hrs_per_week 		 152.459

We see that fnlwgt, capital-gain and capital loss have the highest variance. This can occur either because these have a lot of information or... they have few, very extreme values. Let's check those out.

In [43]:
plt.figure(figsize=(15,10))
sns.distplot(adult_data['fnlwgt'])
plt.show()

Do not be fooled by the tiny 0.2 steps because at the end, there's a 1e6. It means it is 1.0e+06 or 1000000 or 1 * 10^6 So that 0.2 is actually 200000. That would explain the high variance.

In [44]:
plt.figure(figsize=(15,10))
sns.boxplot(adult_data['fnlwgt'])
plt.show()

We see that there are a large number of outliers here. Our median lies between the 0.2 * 10^6 side but a lot of other points cross our 75th percentile. We will have to treat this column for outliers.

In [45]:
plt.figure(figsize=(15,10))
sns.distplot(adult_data['Capital-gain'])
plt.show()

This graph is quite interesting. Most of our data tends towards the zero side of the graph. However, some of the data is in the 5k-20k range and there is some data in the 100,000 range as well! Now that outlier right there would throw off our variance by a lot. We need to deal with those outliers eventually or when we try to make models later, we will not be able to make a good prediction.

In [46]:
plt.figure(figsize=(15,10))
sns.boxplot(adult_data['Capital-gain'])
plt.show()

That just looks like a lot of outliers as almost all of our data was centered towards 0 impling very few people got a capital gain. Without much capital gain, it is difficult to break the <=50k barrier. That would help explain why so many people in the survey had a income potential of <=50k.

In [47]:
plt.figure(figsize=(15,10))
sns.distplot(adult_data['Capital-loss'])
plt.show()

We again see the data is centered towards 0 with some outliers near 2000. We will have to clean or scale this data.

In [48]:
plt.figure(figsize=(15,10))
sns.boxplot(adult_data['Capital-loss'])
plt.show()

Again, there are a large amount of people with no capital loss. We also saw a large amount of people do not have any capital gain either. So maybe people in our sample do not invest or have passive income or take risks and so on. This is kind of sad to see. But at least there were not large losses. The highest loss we see is somewhere in the range of 5000.

In [49]:
plt.figure(figsize=(15,10))
sns.distplot(adult_data['Education-num'])
plt.show()
In [50]:
plt.figure(figsize=(15,10))
sns.boxplot(adult_data['Education-num'])
plt.show()

We see that most of the people fall within the 9-12 range with data skewed towards the left. Which is the HS-grad part of the graph. And few people are well below our education number threshold at 4 to 1.

In [51]:
plt.figure(figsize=(15,10))
sns.distplot(adult_data['hrs_per_week'])
plt.show()

We see that a lot of people work around 40 hours per week. We see there are some people who work towards the 0 side of the graph. They may be people who work without pay and those who do not get paid at all. We will check for those as well.

In [52]:
plt.figure(figsize=(15,10))
sns.boxplot(adult_data['hrs_per_week'])
plt.show()

We see a lot of people work 40 hour weeks but that ranges from 30-50. However a lot of people people work a lot longer and less than that. Some even work 100 hour weeks! They are either very passionate, or are having a very bad time.

In [53]:
plt.figure(figsize=(15,10))
sns.distplot(adult_data['Age'])
plt.show()

We see the Age is right skewed as more people at an younger age work in this survey.

In [54]:
plt.figure(figsize=(15,10))
sns.boxplot(adult_data['Age'])
plt.show()

We see that most people who work are within the age group of 17 to a little less than 80. 80 is... well hats off to them and to those who work well beyond that up to their early 90s. It is facinating to see people work till that age. They must be very passionate about what they do. Or there may be something sadder at play.


Correlation between the numeric columns.

In [55]:
adult_data.corr()
Out[55]:
Age fnlwgt Education-num Capital-gain Capital-loss hrs_per_week
Age 1.000000 -0.076646 0.036527 0.077674 0.057775 0.068756
fnlwgt -0.076646 1.000000 -0.043195 0.000432 -0.010252 -0.018768
Education-num 0.036527 -0.043195 1.000000 0.122630 0.079923 0.148123
Capital-gain 0.077674 0.000432 0.122630 1.000000 -0.031615 0.078409
Capital-loss 0.057775 -0.010252 0.079923 -0.031615 1.000000 0.054256
hrs_per_week 0.068756 -0.018768 0.148123 0.078409 0.054256 1.000000

We see that we see no direct corelation between any of the data. This does not mean that none of the data here is correlated, we just haven't been able to find that correlation yet.
Lets encode and scale our data. That way, our models will have an easier time working with our data.

Preprocessing

Null Value Treatment


Filling the null columns.

In [56]:
null_columns = adult_data.columns[adult_data.isnull().any()]
adult_data[null_columns].isnull().sum()
Out[56]:
Series([], dtype: float64)

Checking for mode of the Null columns. We can directly check for the mode instead of checking for tyhe type of the columns because we have previously established at the beginning that only 3 categorical columns have null values.

In [57]:
# adult_data.loc[:, null_columns].mode()
In [58]:
#checking for dataset info before replacing columns.
adult_data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32561 entries, 0 to 32560
Data columns (total 15 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   Age                32561 non-null  int64 
 1   Workclass          32561 non-null  object
 2   fnlwgt             32561 non-null  int64 
 3   Education          32561 non-null  object
 4   Education-num      32561 non-null  int64 
 5   Marital_Status     32561 non-null  object
 6   Occupation         32561 non-null  object
 7   Relationship       32561 non-null  object
 8   Race               32561 non-null  object
 9   Sex                32561 non-null  object
 10  Capital-gain       32561 non-null  int64 
 11  Capital-loss       32561 non-null  int64 
 12  hrs_per_week       32561 non-null  int64 
 13  Native-Country     32561 non-null  object
 14  Earning_potential  32561 non-null  object
dtypes: int64(6), object(9)
memory usage: 2.6+ MB
In [59]:
for i in list(null_columns):
    adult_data[i].fillna(adult_data[i].mode().values[0],inplace=True)
In [60]:
print('{null_sum} \n\n {adult_data_info}'.format(null_sum=adult_data.isna().sum(), adult_data_info=adult_data.info()))
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32561 entries, 0 to 32560
Data columns (total 15 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   Age                32561 non-null  int64 
 1   Workclass          32561 non-null  object
 2   fnlwgt             32561 non-null  int64 
 3   Education          32561 non-null  object
 4   Education-num      32561 non-null  int64 
 5   Marital_Status     32561 non-null  object
 6   Occupation         32561 non-null  object
 7   Relationship       32561 non-null  object
 8   Race               32561 non-null  object
 9   Sex                32561 non-null  object
 10  Capital-gain       32561 non-null  int64 
 11  Capital-loss       32561 non-null  int64 
 12  hrs_per_week       32561 non-null  int64 
 13  Native-Country     32561 non-null  object
 14  Earning_potential  32561 non-null  object
dtypes: int64(6), object(9)
memory usage: 2.6+ MB
Age                  0
Workclass            0
fnlwgt               0
Education            0
Education-num        0
Marital_Status       0
Occupation           0
Relationship         0
Race                 0
Sex                  0
Capital-gain         0
Capital-loss         0
hrs_per_week         0
Native-Country       0
Earning_potential    0
dtype: int64 

 None

Now that we have treated our data and cleaned our null values, we can go ahead and encode our data.

Label Encoding

Label Encoding our categorical columns. We can one hot encode them too but that's a whole different thing in itself.

In [61]:
adult_data[categorical_columns].head()
Out[61]:
Workclass Education Marital_Status Occupation Relationship Race Sex Native-Country Earning_potential
0 State-gov Bachelors Never-married Adm-clerical Not-in-family White Male United-States <=50K
1 Self-emp-not-inc Bachelors Married-civ-spouse Exec-managerial Husband White Male United-States <=50K
2 Private HS-grad Divorced Handlers-cleaners Not-in-family White Male United-States <=50K
3 Private 11th Married-civ-spouse Handlers-cleaners Husband Black Male United-States <=50K
4 Private Bachelors Married-civ-spouse Prof-specialty Wife Black Female Cuba <=50K
In [62]:
label_encoder = LabelEncoder()
encoded_adult_data = adult_data
for i in categorical_columns:
    encoded_adult_data[i] = label_encoder.fit_transform(adult_data[i])
encoded_adult_data[categorical_columns].head()
Out[62]:
Workclass Education Marital_Status Occupation Relationship Race Sex Native-Country Earning_potential
0 7 9 4 1 1 4 1 39 0
1 6 9 2 4 0 4 1 39 0
2 4 11 0 6 1 4 1 39 0
3 4 1 2 6 0 2 1 39 0
4 4 9 2 10 5 2 0 5 0

Scaling the data

Usually we will perform multiple types of scaling and see which one works on the dataset better.

In this case we will use minmax scaler.

In [63]:
min_max_scaler = MinMaxScaler()

scaled_encoded_adult_data = pd.DataFrame()

column_values = encoded_adult_data.columns.values
column_values = column_values[:-1]
print(column_values[-1])

scaled_values = min_max_scaler.fit_transform(encoded_adult_data[column_values])

for i in range(len(column_values)):
    scaled_encoded_adult_data[column_values[i]] = scaled_values[:,i]
    
scaled_encoded_adult_data['Earning_potential'] = encoded_adult_data['Earning_potential']
scaled_encoded_adult_data.sample(10)

# encoded_adult_data.head()
Native-Country
Out[63]:
Age Workclass fnlwgt Education Education-num Marital_Status Occupation Relationship Race Sex Capital-gain Capital-loss hrs_per_week Native-Country Earning_potential
10715 0.726027 0.00 0.058099 0.733333 0.533333 1.000000 0.000000 0.8 1.0 0.0 0.00000 0.0 0.030612 0.95122 0
2302 0.123288 0.50 0.254207 1.000000 0.600000 0.000000 0.928571 0.2 1.0 1.0 0.00000 0.0 0.602041 0.95122 0
9946 0.424658 0.50 0.108340 0.600000 0.800000 0.000000 0.571429 0.8 1.0 0.0 0.00000 0.0 0.561224 0.95122 1
15011 0.041096 0.50 0.138958 1.000000 0.600000 0.666667 0.785714 0.6 1.0 1.0 0.00000 0.0 0.397959 0.95122 0
24169 0.561644 0.75 0.102391 0.600000 0.800000 0.000000 0.714286 0.2 1.0 0.0 0.00000 0.0 0.397959 0.95122 0
15716 0.397260 0.50 0.096885 1.000000 0.600000 0.000000 0.857143 0.8 1.0 0.0 0.00000 0.0 0.397959 0.95122 0
16933 0.328767 0.50 0.072657 0.733333 0.533333 0.333333 0.571429 0.0 1.0 1.0 0.00000 0.0 0.397959 0.95122 0
9138 0.136986 0.00 0.033348 0.600000 0.800000 0.333333 0.000000 0.0 1.0 1.0 0.00000 0.0 0.142857 0.95122 0
29289 0.178082 0.50 0.216741 0.466667 0.733333 0.666667 0.214286 0.2 1.0 1.0 0.04787 0.0 0.500000 0.95122 1
13628 0.178082 0.50 0.096328 0.600000 0.800000 0.666667 0.857143 0.2 1.0 1.0 0.00000 0.0 0.653061 0.95122 0
In [64]:
scaled_encoded_adult_data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32561 entries, 0 to 32560
Data columns (total 15 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Age                32561 non-null  float64
 1   Workclass          32561 non-null  float64
 2   fnlwgt             32561 non-null  float64
 3   Education          32561 non-null  float64
 4   Education-num      32561 non-null  float64
 5   Marital_Status     32561 non-null  float64
 6   Occupation         32561 non-null  float64
 7   Relationship       32561 non-null  float64
 8   Race               32561 non-null  float64
 9   Sex                32561 non-null  float64
 10  Capital-gain       32561 non-null  float64
 11  Capital-loss       32561 non-null  float64
 12  hrs_per_week       32561 non-null  float64
 13  Native-Country     32561 non-null  float64
 14  Earning_potential  32561 non-null  int32  
dtypes: float64(14), int32(1)
memory usage: 3.6 MB
In [65]:
scaled_encoded_adult_data.describe().T
Out[65]:
count mean std min 25% 50% 75% max
Age 32561.0 0.295639 0.186855 0.0 0.150685 0.273973 0.424658 1.0
Workclass 32561.0 0.483612 0.181995 0.0 0.500000 0.500000 0.500000 1.0
fnlwgt 32561.0 0.120545 0.071685 0.0 0.071679 0.112788 0.152651 1.0
Education 32561.0 0.686547 0.258018 0.0 0.600000 0.733333 0.800000 1.0
Education-num 32561.0 0.605379 0.171515 0.0 0.533333 0.600000 0.733333 1.0
Marital_Status 32561.0 0.435306 0.251037 0.0 0.333333 0.333333 0.666667 1.0
Occupation 32561.0 0.469481 0.302061 0.0 0.214286 0.500000 0.714286 1.0
Relationship 32561.0 0.289272 0.321354 0.0 0.000000 0.200000 0.600000 1.0
Race 32561.0 0.916464 0.212201 0.0 1.000000 1.000000 1.000000 1.0
Sex 32561.0 0.669205 0.470506 0.0 0.000000 1.000000 1.000000 1.0
Capital-gain 32561.0 0.010777 0.073854 0.0 0.000000 0.000000 0.000000 1.0
Capital-loss 32561.0 0.020042 0.092507 0.0 0.000000 0.000000 0.000000 1.0
hrs_per_week 32561.0 0.402423 0.125994 0.0 0.397959 0.397959 0.448980 1.0
Native-Country 32561.0 0.895582 0.190824 0.0 0.951220 0.951220 0.951220 1.0
Earning_potential 32561.0 0.240810 0.427581 0.0 0.000000 0.000000 0.000000 1.0

Outlier detection

In [66]:
for i in range(len(numerical_columns)):
    plt.figure(figsize=(15,10))
    sns.boxplot(scaled_encoded_adult_data[numerical_columns[i]])
plt.show() 

As we can see in the graphs above, Scaling does nothing to the distribution and does not deal with the outliers either. We have to take care of the outliers.

If the columns are continuous, we replace the outliers with the value of the medians and if they are categorical, we replace the outliers with the mode.
We have already established the numeric and categorical columns. So, it will be easier for us to deal with them now.

Outlier Treatment -> Replace with median

In [67]:
def outlier_detector(datacolumn):
    sorted(datacolumn)
    Q1,Q3 = np.percentile(datacolumn,[25,75])
    IQR = Q3 - Q1
    lower_bound = Q1-(1.5*IQR)
    upper_bound = Q3+(1.5*IQR)
    return lower_bound,upper_bound
# This takes a column of the dataframe (a series), 
# checks for the percentile we want to check it for and then calculates and the upper and lower bounds
In [68]:
lowerbound, upperbound = outlier_detector(scaled_encoded_adult_data['Age'])
lowerbound, upperbound
Out[68]:
(-0.2602739726027397, 0.8356164383561644)
In [69]:
scaled_encoded_adult_data[(scaled_encoded_adult_data.Age < lowerbound) | (scaled_encoded_adult_data.Age > upperbound)]
Out[69]:
Age Workclass fnlwgt Education Education-num Marital_Status Occupation Relationship Race Sex Capital-gain Capital-loss hrs_per_week Native-Country Earning_potential
74 0.849315 0.500 0.076377 1.000000 0.600000 0.333333 0.714286 0.4 1.00 1.0 0.000000 0.000000 0.193878 0.951220 0
222 1.000000 0.500 0.026799 0.733333 0.533333 0.666667 0.571429 0.2 0.50 1.0 0.000000 0.506428 0.397959 0.951220 0
430 0.863014 0.000 0.064844 0.733333 0.533333 1.000000 0.000000 0.2 1.00 1.0 0.000000 0.000000 0.234694 0.951220 0
918 0.876712 0.750 0.084064 0.733333 0.533333 0.333333 0.285714 0.0 1.00 1.0 0.000000 0.000000 0.295918 0.951220 0
1040 1.000000 0.500 0.084713 0.733333 0.533333 0.666667 0.571429 0.2 1.00 0.0 0.000000 0.000000 0.397959 0.951220 0
1168 0.972603 0.750 0.131760 0.933333 0.933333 0.333333 0.714286 0.0 1.00 1.0 0.000000 0.000000 0.397959 0.951220 0
1935 1.000000 0.500 0.142315 0.600000 0.800000 0.333333 0.285714 0.0 1.00 1.0 0.000000 0.000000 0.448980 0.951220 0
2303 1.000000 0.500 0.027235 1.000000 0.600000 0.666667 0.571429 0.2 0.25 1.0 0.000000 0.000000 0.346939 0.951220 0
2754 0.863014 0.750 0.116848 0.333333 0.200000 0.333333 0.357143 0.0 1.00 1.0 0.000000 0.000000 0.295918 0.951220 0
2891 1.000000 0.500 0.108441 1.000000 0.600000 0.833333 0.071429 0.6 1.00 0.0 0.000000 0.000000 0.397959 0.804878 0
2906 0.876712 0.500 0.069535 0.400000 0.266667 1.000000 0.642857 0.2 0.50 0.0 0.020620 0.000000 0.040816 0.951220 0
3211 0.890411 0.000 0.011652 0.333333 0.200000 1.000000 0.000000 0.2 1.00 1.0 0.000000 0.000000 0.040816 0.951220 0
3338 0.849315 0.000 0.089817 0.733333 0.533333 1.000000 0.000000 0.2 0.50 0.0 0.000000 0.000000 0.295918 0.951220 0
3537 0.876712 0.750 0.084713 0.733333 0.533333 1.000000 0.071429 0.2 1.00 0.0 0.000000 0.000000 0.193878 0.951220 0
3777 0.863014 0.500 0.051095 1.000000 0.600000 0.666667 0.714286 0.2 1.00 1.0 0.000000 0.416896 0.602041 0.951220 0
3963 0.904110 0.000 0.162770 0.733333 0.533333 1.000000 0.000000 0.2 1.00 0.0 0.000000 0.000000 0.193878 0.951220 0
4070 1.000000 0.500 0.204901 0.066667 0.400000 0.666667 0.428571 0.6 1.00 1.0 0.000000 0.000000 0.397959 0.951220 0
4109 1.000000 0.000 0.165869 0.600000 0.800000 1.000000 0.000000 0.4 1.00 0.0 0.009910 0.000000 0.091837 0.951220 0
4720 0.849315 0.750 0.099180 0.733333 0.533333 1.000000 0.571429 0.2 1.00 0.0 0.000000 0.000000 0.234694 0.951220 0
4834 0.876712 0.500 0.238936 1.000000 0.600000 0.000000 0.071429 0.2 1.00 0.0 0.000000 0.000000 0.193878 0.951220 0
5104 1.000000 0.500 0.027235 1.000000 0.600000 0.666667 0.571429 0.2 0.25 1.0 0.000000 0.000000 0.346939 0.951220 0
5272 1.000000 0.500 0.087932 0.400000 0.266667 0.666667 0.071429 0.2 1.00 0.0 0.000000 0.000000 0.397959 0.951220 0
5291 0.863014 0.500 0.098812 0.800000 0.866667 1.000000 0.714286 0.2 1.00 0.0 0.000000 0.000000 0.091837 0.951220 0
5370 1.000000 0.250 0.146365 0.800000 0.866667 0.333333 0.285714 0.0 1.00 1.0 0.200512 0.000000 0.602041 0.951220 1
5406 1.000000 0.500 0.026799 0.800000 0.866667 0.666667 0.285714 0.2 0.50 1.0 0.000000 0.000000 0.500000 0.951220 1
6000 0.849315 0.250 0.049124 0.066667 0.400000 0.333333 0.285714 0.0 1.00 1.0 0.000000 0.000000 0.061224 0.951220 0
6173 0.849315 0.500 0.073635 0.666667 1.000000 0.333333 0.714286 0.0 1.00 1.0 0.200512 0.000000 0.346939 0.195122 1
6214 0.917808 0.750 0.096964 0.600000 0.800000 0.333333 0.357143 0.0 1.00 1.0 0.000000 0.000000 0.193878 0.951220 0
6232 1.000000 0.750 0.097592 0.600000 0.800000 0.333333 0.714286 0.0 1.00 1.0 0.105661 0.000000 0.500000 0.951220 0
6439 0.863014 0.500 0.161434 0.333333 0.200000 1.000000 0.571429 0.2 1.00 0.0 0.000000 0.000000 0.234694 0.951220 0
6624 1.000000 0.500 0.204901 0.066667 0.400000 0.333333 0.214286 0.0 1.00 1.0 0.000000 0.000000 0.397959 0.951220 0
6748 0.876712 0.500 0.074956 1.000000 0.600000 0.333333 0.857143 0.0 1.00 1.0 0.000000 0.000000 0.142857 0.951220 0
7481 0.904110 0.500 0.124436 0.733333 0.533333 1.000000 0.785714 0.2 1.00 1.0 0.000000 0.000000 0.551020 0.951220 0
7720 0.917808 0.500 0.155377 0.800000 0.866667 0.666667 0.714286 0.2 1.00 1.0 0.000000 0.000000 0.663265 0.951220 0
7872 0.876712 0.000 0.102279 0.733333 0.533333 0.000000 0.000000 0.2 1.00 0.0 0.000000 0.000000 0.346939 0.951220 0
8176 0.849315 0.500 0.074050 1.000000 0.600000 0.333333 0.071429 0.0 1.00 1.0 0.000000 0.000000 0.193878 0.951220 0
8381 0.931507 0.500 0.070007 0.733333 0.533333 1.000000 0.857143 0.8 1.00 1.0 0.000000 0.000000 0.346939 0.951220 0
8431 0.890411 0.500 0.094989 0.333333 0.200000 0.333333 0.428571 0.0 1.00 1.0 0.000000 0.000000 0.010204 0.951220 0
8522 0.849315 0.500 0.217971 0.733333 0.533333 0.500000 0.714286 0.2 1.00 1.0 0.000000 0.000000 0.051020 0.951220 0
8694 0.863014 0.000 0.011366 0.933333 0.933333 0.333333 0.000000 0.0 1.00 1.0 0.106051 0.000000 0.091837 0.951220 1
8806 1.000000 0.500 0.050996 0.933333 0.933333 0.333333 0.714286 0.0 1.00 1.0 0.200512 0.000000 0.724490 0.951220 1
8963 1.000000 0.000 0.043987 0.733333 0.533333 1.000000 0.000000 0.2 1.00 0.0 0.000000 1.000000 0.397959 0.951220 0
8973 1.000000 0.500 0.023431 0.600000 0.800000 0.333333 0.857143 0.0 1.00 1.0 0.093861 0.000000 0.142857 0.951220 1
9471 0.917808 0.250 0.102824 0.733333 0.533333 1.000000 0.285714 0.2 1.00 0.0 0.000000 0.000000 0.326531 0.951220 0
10124 0.863014 0.750 0.014979 0.333333 0.200000 1.000000 0.357143 0.2 1.00 1.0 0.000000 0.000000 0.346939 0.951220 0
10210 1.000000 0.750 0.183243 1.000000 0.600000 0.333333 0.357143 0.0 1.00 1.0 0.000000 0.000000 0.397959 0.951220 0
10545 1.000000 0.500 0.110842 0.733333 0.533333 0.333333 0.214286 0.0 1.00 1.0 0.093861 0.000000 0.500000 0.170732 1
11099 0.849315 0.000 0.103859 0.733333 0.533333 0.333333 0.000000 0.0 1.00 1.0 0.000000 0.000000 0.397959 0.951220 0
11238 0.917808 0.625 0.109087 1.000000 0.600000 0.333333 0.857143 0.0 1.00 1.0 0.000000 0.000000 0.346939 0.951220 1
11512 1.000000 0.500 0.050937 0.733333 0.533333 0.666667 0.571429 0.6 1.00 0.0 0.000000 0.000000 0.234694 0.951220 0
11532 0.849315 0.000 0.088348 0.933333 0.933333 0.333333 0.000000 0.0 1.00 1.0 0.000000 0.000000 0.091837 0.951220 0
11731 1.000000 0.000 0.018703 0.733333 0.533333 1.000000 0.000000 0.2 1.00 1.0 0.004010 0.000000 0.030612 0.951220 0
11996 1.000000 0.500 0.019086 0.600000 0.800000 0.666667 0.285714 0.2 1.00 1.0 0.000000 0.000000 0.551020 0.951220 0
12451 1.000000 0.000 0.144509 1.000000 0.600000 0.666667 0.000000 0.6 0.25 1.0 0.000000 0.000000 0.091837 0.853659 0
12492 0.890411 0.000 0.027598 1.000000 0.600000 1.000000 0.000000 0.2 0.00 1.0 0.000000 0.000000 0.020408 0.951220 0
12830 0.876712 0.500 0.128437 0.800000 0.866667 1.000000 0.714286 0.8 1.00 1.0 0.000000 0.000000 0.602041 0.000000 0
12975 1.000000 0.500 0.162010 0.000000 0.333333 0.333333 0.285714 0.0 1.00 1.0 0.000000 0.000000 0.397959 0.951220 0
13025 0.917808 0.000 0.242213 0.266667 0.133333 1.000000 0.000000 0.2 1.00 1.0 0.000000 0.000000 0.142857 0.951220 0
13026 0.849315 0.000 0.060170 0.466667 0.733333 0.333333 0.000000 1.0 1.00 0.0 0.000000 0.000000 0.010204 0.951220 1
13295 0.876712 0.500 0.060030 0.200000 0.066667 0.333333 0.714286 0.0 1.00 1.0 0.000000 0.000000 0.142857 0.756098 0
13696 0.890411 0.750 0.154987 0.733333 0.533333 0.333333 0.214286 0.0 1.00 1.0 0.000000 0.000000 0.397959 0.121951 0
13928 0.876712 0.750 0.075844 0.600000 0.800000 1.000000 0.714286 0.2 1.00 0.0 0.000000 0.382920 0.020408 0.439024 0
14104 0.863014 0.500 0.092595 0.933333 0.933333 0.333333 0.714286 0.0 1.00 1.0 0.000000 0.000000 0.346939 0.951220 1
14159 1.000000 0.250 0.119167 0.466667 0.733333 0.333333 0.071429 0.0 0.25 1.0 0.000000 0.000000 0.193878 0.731707 0
14604 0.863014 0.500 0.109482 1.000000 0.600000 1.000000 0.071429 0.2 1.00 0.0 0.000000 0.000000 0.193878 0.951220 0
14711 0.917808 0.250 0.083912 0.533333 0.666667 1.000000 0.071429 0.2 1.00 0.0 0.000000 0.000000 0.132653 0.951220 0
14756 0.890411 0.500 0.081896 0.733333 0.533333 1.000000 0.285714 0.2 1.00 0.0 0.000000 1.000000 0.173469 0.951220 0
14903 0.849315 0.625 0.201700 0.733333 0.533333 0.333333 0.285714 0.0 1.00 1.0 0.000000 0.000000 0.397959 0.951220 1
15356 1.000000 0.500 0.053136 0.733333 0.533333 1.000000 1.000000 0.8 1.00 1.0 0.000000 0.000000 1.000000 0.951220 0
15662 0.917808 0.500 0.081852 0.733333 0.533333 1.000000 0.571429 0.2 1.00 0.0 0.000000 0.000000 0.122449 0.951220 0
15892 1.000000 0.500 0.052095 0.600000 0.800000 0.333333 0.285714 1.0 1.00 0.0 0.000000 0.000000 0.397959 0.219512 1
16302 0.904110 0.750 0.136905 0.733333 0.533333 1.000000 0.285714 0.2 1.00 1.0 0.000000 0.000000 0.071429 0.951220 0
16523 0.849315 0.000 0.102454 0.733333 0.533333 0.333333 0.000000 0.0 1.00 1.0 0.000000 0.000000 0.397959 0.951220 1
16762 0.876712 0.000 0.052111 0.733333 0.533333 0.333333 0.000000 0.0 1.00 1.0 0.000000 0.000000 0.173469 0.951220 0
16901 0.863014 0.750 0.060775 0.066667 0.400000 0.333333 0.285714 0.0 1.00 1.0 0.000000 0.000000 0.244898 0.951220 0
17609 0.849315 0.625 0.057590 0.133333 0.466667 1.000000 0.857143 0.2 1.00 1.0 0.184812 0.000000 0.448980 0.951220 1
18037 0.863014 0.750 0.081799 0.733333 0.533333 0.666667 0.285714 0.2 1.00 1.0 0.000000 0.000000 0.193878 0.951220 0
18141 0.849315 0.750 0.049370 0.533333 0.666667 0.333333 0.857143 0.0 1.00 1.0 0.000000 0.000000 0.193878 0.951220 1
18272 0.863014 0.500 0.050139 0.666667 1.000000 0.333333 0.714286 0.0 1.00 1.0 0.000000 0.000000 0.295918 0.951220 0
18277 1.000000 0.500 0.202998 0.600000 0.800000 0.333333 0.857143 0.0 1.00 1.0 0.000000 0.000000 0.193878 0.000000 0
18413 1.000000 0.500 0.204740 0.600000 0.800000 0.666667 0.714286 0.6 1.00 0.0 0.000000 0.000000 0.091837 0.951220 0
18560 0.863014 0.000 0.109032 0.733333 0.533333 0.333333 0.000000 0.0 1.00 1.0 0.000000 0.000000 0.071429 0.951220 0
18725 1.000000 0.250 0.095976 0.733333 0.533333 0.333333 0.571429 0.0 1.00 1.0 0.067671 0.000000 0.397959 0.951220 0
18832 1.000000 0.500 0.069967 0.800000 0.866667 0.666667 0.285714 0.6 1.00 0.0 0.000000 0.000000 0.397959 0.951220 0
19045 0.876712 0.875 0.081443 0.200000 0.066667 1.000000 0.571429 0.2 1.00 0.0 0.000000 0.000000 0.193878 0.951220 0
19172 0.904110 0.625 0.176555 0.733333 0.533333 0.000000 0.857143 0.2 1.00 1.0 0.000000 0.000000 0.193878 0.951220 0
19180 0.890411 0.000 0.020476 0.000000 0.333333 1.000000 0.000000 0.2 1.00 1.0 0.000000 0.000000 0.193878 0.951220 0
19212 1.000000 0.500 0.086507 1.000000 0.600000 0.000000 0.857143 0.8 0.50 0.0 0.000000 0.000000 0.367347 0.951220 0
19489 1.000000 0.500 0.049081 0.733333 0.533333 0.333333 0.500000 0.0 1.00 1.0 0.000000 0.000000 0.397959 0.951220 0
19495 0.876712 0.625 0.159565 0.000000 0.333333 0.333333 0.285714 1.0 1.00 0.0 0.029360 0.000000 0.275510 0.951220 0
19515 0.863014 0.250 0.005308 0.733333 0.533333 1.000000 0.571429 0.8 0.00 0.0 0.000000 0.000000 0.316327 0.951220 0
19689 0.863014 0.750 0.373569 0.733333 0.533333 0.333333 0.357143 0.0 1.00 1.0 0.000000 0.000000 0.193878 0.951220 0
19747 1.000000 0.500 0.145803 0.333333 0.200000 0.333333 0.500000 0.0 1.00 1.0 0.000000 0.000000 0.397959 0.951220 0
19828 0.849315 0.500 0.108621 0.333333 0.200000 1.000000 0.642857 0.2 1.00 0.0 0.029640 0.000000 0.295918 0.951220 0
20249 0.863014 0.500 0.163120 0.533333 0.666667 0.333333 0.214286 0.0 1.00 1.0 0.000000 0.000000 0.234694 0.951220 0
20421 0.890411 0.500 0.091987 0.266667 0.133333 1.000000 0.571429 0.8 1.00 1.0 0.000000 0.000000 0.193878 0.951220 0
20463 0.931507 0.750 0.104415 0.733333 0.533333 1.000000 0.857143 0.2 1.00 0.0 0.000000 0.000000 0.500000 0.951220 0
20482 0.863014 0.500 0.129174 0.733333 0.533333 0.500000 0.071429 0.2 1.00 0.0 0.000000 0.000000 0.153061 0.951220 0
20483 0.849315 0.250 0.090979 0.666667 1.000000 1.000000 0.714286 0.8 1.00 0.0 0.000000 0.000000 0.397959 0.951220 0
20610 1.000000 0.500 0.132015 0.800000 0.866667 0.333333 0.714286 1.0 1.00 0.0 0.000000 0.000000 0.397959 0.951220 1
20826 0.876712 0.000 0.091558 0.600000 0.800000 1.000000 0.000000 0.2 1.00 1.0 0.000000 0.000000 0.040816 0.951220 0
20880 0.849315 0.000 0.088213 0.333333 0.200000 0.333333 0.000000 0.0 1.00 1.0 0.014090 0.000000 0.346939 0.951220 0
20953 0.863014 0.000 0.110505 0.733333 0.533333 0.333333 0.000000 0.0 1.00 1.0 0.000000 0.000000 0.071429 0.048780 0
21343 0.849315 0.500 0.172392 0.600000 0.800000 0.333333 0.857143 0.0 1.00 1.0 0.000000 0.000000 0.397959 0.951220 0
21501 0.876712 0.500 0.112144 0.733333 0.533333 0.333333 0.285714 0.0 1.00 1.0 0.000000 0.545684 0.255102 0.951220 1
21812 0.890411 0.000 0.123813 0.533333 0.666667 1.000000 0.000000 0.2 1.00 0.0 0.000000 0.000000 0.071429 0.951220 0
21835 0.972603 0.750 0.118724 0.933333 0.933333 0.333333 0.285714 0.0 1.00 1.0 0.000000 0.000000 0.397959 0.951220 0
22220 1.000000 0.500 0.027235 0.600000 0.800000 0.666667 0.714286 0.2 0.25 1.0 0.000000 0.000000 0.397959 0.951220 0
22481 0.890411 0.625 0.073432 1.000000 0.600000 1.000000 0.857143 0.2 1.00 1.0 0.000000 0.000000 0.193878 0.951220 0
22895 0.972603 0.500 0.038205 1.000000 0.600000 0.000000 0.071429 0.8 1.00 0.0 0.000000 0.000000 0.397959 0.951220 0
22898 0.917808 0.000 0.078034 0.266667 0.133333 0.333333 0.000000 0.0 1.00 1.0 0.000000 0.000000 0.193878 0.951220 0
23459 0.904110 0.625 0.154755 0.000000 0.333333 0.333333 0.357143 0.0 1.00 1.0 0.200512 0.000000 0.500000 0.951220 1
23900 0.849315 0.750 0.062074 0.733333 0.533333 0.333333 0.357143 0.0 1.00 1.0 0.000000 0.000000 0.397959 0.951220 0
24027 0.945205 0.500 0.093470 0.800000 0.866667 0.666667 0.071429 0.2 1.00 0.0 0.000000 0.000000 0.397959 0.951220 0
24043 1.000000 0.750 0.047774 0.733333 0.533333 0.666667 0.285714 0.2 1.00 1.0 0.029640 0.000000 0.112245 0.951220 0
24238 1.000000 0.000 0.104629 0.200000 0.066667 1.000000 0.000000 0.2 0.50 0.0 0.000000 0.000000 0.397959 0.951220 0
24280 0.890411 0.625 0.080170 0.333333 0.200000 0.333333 0.357143 0.0 1.00 1.0 0.000000 0.000000 0.500000 0.951220 0
24395 0.904110 0.625 0.095691 0.600000 0.800000 0.333333 0.285714 0.0 1.00 1.0 0.000000 0.549128 0.551020 0.951220 1
24560 0.876712 0.500 0.058292 0.600000 0.800000 1.000000 0.857143 0.2 1.00 1.0 0.000000 0.000000 0.500000 0.951220 1
25163 0.849315 0.000 0.043708 0.800000 0.866667 0.333333 0.000000 0.0 1.00 1.0 0.200512 0.000000 0.397959 0.756098 1
25303 1.000000 0.000 0.110810 0.333333 0.200000 0.833333 0.000000 0.2 1.00 0.0 0.000000 0.000000 0.142857 0.951220 0
25397 0.863014 0.000 0.054072 0.733333 0.533333 0.333333 0.000000 0.0 1.00 1.0 0.000000 0.000000 0.244898 0.951220 0
26012 0.876712 0.000 0.064166 1.000000 0.600000 1.000000 0.000000 0.8 1.00 0.0 0.000000 0.000000 0.030612 0.951220 0
26242 0.849315 0.625 0.116408 0.600000 0.800000 0.333333 0.857143 0.0 1.00 1.0 0.000000 0.000000 0.193878 0.951220 1
26731 0.917808 0.500 0.119560 0.733333 0.533333 1.000000 0.714286 0.2 1.00 0.0 0.000000 0.000000 0.153061 0.951220 0
27795 0.917808 0.500 0.255429 0.333333 0.200000 0.333333 0.714286 0.0 0.50 1.0 0.000000 0.000000 0.091837 0.951220 0
28176 0.849315 0.125 0.033884 0.666667 1.000000 1.000000 0.285714 0.2 1.00 1.0 0.000000 0.000000 0.051020 0.951220 1
28463 1.000000 0.125 0.124386 0.733333 0.533333 0.333333 0.214286 0.0 1.00 1.0 0.000000 0.000000 0.295918 0.951220 0
28721 0.863014 0.750 0.145072 0.733333 0.533333 0.333333 0.571429 0.0 1.00 1.0 0.014090 0.000000 0.397959 0.951220 0
28948 0.876712 0.500 0.079497 0.333333 0.200000 0.333333 1.000000 0.0 1.00 1.0 0.000000 0.000000 0.091837 0.951220 0
29594 0.876712 0.750 0.122894 0.200000 0.066667 1.000000 0.857143 0.4 1.00 1.0 0.000000 0.000000 0.448980 0.634146 0
29724 0.876712 0.000 0.052367 0.933333 0.933333 0.333333 0.000000 0.0 1.00 1.0 0.000000 0.000000 0.234694 0.951220 1
31030 1.000000 0.500 0.024208 0.733333 0.533333 0.333333 0.500000 0.0 1.00 1.0 0.000000 0.000000 0.397959 0.951220 0
31432 0.958904 0.000 0.053010 0.733333 0.533333 1.000000 0.000000 0.2 1.00 1.0 0.000000 0.000000 0.010204 0.951220 0
31696 1.000000 0.000 0.204901 0.733333 0.533333 0.333333 0.000000 0.0 1.00 1.0 0.000000 0.000000 0.397959 0.951220 1
31814 0.863014 0.750 0.009902 0.333333 0.200000 0.666667 0.357143 0.8 1.00 1.0 0.000000 0.000000 0.193878 0.951220 0
31836 0.863014 0.000 0.183020 0.466667 0.733333 0.333333 0.000000 0.0 1.00 1.0 0.000000 0.000000 0.030612 0.951220 0
31855 0.890411 0.750 0.040174 0.733333 0.533333 0.333333 0.357143 0.0 1.00 1.0 0.000000 0.000000 0.193878 0.951220 0
32277 1.000000 0.500 0.204740 0.733333 0.533333 1.000000 0.071429 0.8 1.00 0.0 0.000000 0.000000 0.244898 0.951220 0
32367 1.000000 0.250 0.137399 0.333333 0.200000 0.333333 0.785714 0.0 1.00 1.0 0.026530 0.000000 0.397959 0.951220 0
32459 0.931507 0.500 0.058629 0.600000 0.800000 0.333333 0.285714 0.0 1.00 1.0 0.000000 0.000000 0.020408 0.756098 0
32494 0.890411 0.000 0.265974 0.733333 0.533333 0.666667 0.000000 0.2 1.00 1.0 0.000000 0.000000 0.020408 0.951220 0
32525 0.876712 0.000 0.073480 0.533333 0.666667 0.000000 0.000000 0.8 1.00 0.0 0.000000 0.000000 0.000000 0.000000 0

Looping the outlier_detector through all numerical columns and then replacing them with the median.

we should not consider the sparse columns so we will remove those from our outlier treatment columns
In [70]:
new_columns = numerical_columns.copy()
new_columns.remove('Capital-gain') #Sparse column, must not be treated
new_columns.remove('Capital-loss') #Sparse column, must not be treated
new_columns
Out[70]:
['Age', 'fnlwgt', 'Education-num', 'hrs_per_week']
In [71]:
treated_scaled_encoded_adult_data = scaled_encoded_adult_data.copy()
for i in new_columns:
    lowerbound, upperbound = outlier_detector(treated_scaled_encoded_adult_data[i])
    median = treated_scaled_encoded_adult_data[i].median()
    treated_scaled_encoded_adult_data[i] = treated_scaled_encoded_adult_data[i].replace(
        to_replace = treated_scaled_encoded_adult_data[(treated_scaled_encoded_adult_data[i] < lowerbound) | 
                                                       (treated_scaled_encoded_adult_data[i] > upperbound)][i],
                                      value = median)
    print('{}: number of outliers: {}'.format(i,treated_scaled_encoded_adult_data[
        (treated_scaled_encoded_adult_data[i] < lowerbound) |
        (treated_scaled_encoded_adult_data[i] > upperbound)][i]))
Age: number of outliers: Series([], Name: Age, dtype: float64)
fnlwgt: number of outliers: Series([], Name: fnlwgt, dtype: float64)
Education-num: number of outliers: Series([], Name: Education-num, dtype: float64)
hrs_per_week: number of outliers: Series([], Name: hrs_per_week, dtype: float64)

Now that we have treated our outliers, we can now go ahead and plot a correlation heatmap.

In [72]:
fig,ax=plt.subplots(figsize=(20,15))
ax=sns.heatmap(treated_scaled_encoded_adult_data.corr(),annot=True)

From the heatmap we see that none of the columns are correlated to each other,
i.e. None of them have a correlation value of >0.7 or <-0.7. So we must find another way to find our features.
Selecting all features and the target column

In [73]:
print(all_columns)

features = all_columns[:-1]
target = treated_scaled_encoded_adult_data['Earning_potential']
print(features)
print(treated_scaled_encoded_adult_data.shape)
['Age', 'Workclass', 'fnlwgt', 'Education', 'Education-num', 'Marital_Status', 'Occupation', 'Relationship', 'Race', 'Sex', 'Capital-gain', 'Capital-loss', 'hrs_per_week', 'Native-Country', 'Earning_potential']
['Age', 'Workclass', 'fnlwgt', 'Education', 'Education-num', 'Marital_Status', 'Occupation', 'Relationship', 'Race', 'Sex', 'Capital-gain', 'Capital-loss', 'hrs_per_week', 'Native-Country']
(32561, 15)

We will now make a new dataframe and use it for our train test splitting.

Copy of main dataframe to use for model training

In [74]:
feature_df = treated_scaled_encoded_adult_data[features]
print(target.head())
feature_df.head()
0    0
1    0
2    0
3    0
4    0
Name: Earning_potential, dtype: int32
Out[74]:
Age Workclass fnlwgt Education Education-num Marital_Status Occupation Relationship Race Sex Capital-gain Capital-loss hrs_per_week Native-Country
0 0.301370 0.875 0.044302 0.600000 0.800000 0.666667 0.071429 0.2 1.0 1.0 0.02174 0.0 0.397959 0.951220
1 0.452055 0.750 0.048238 0.600000 0.800000 0.333333 0.285714 0.0 1.0 1.0 0.00000 0.0 0.397959 0.951220
2 0.287671 0.500 0.138113 0.733333 0.533333 0.000000 0.428571 0.2 1.0 1.0 0.00000 0.0 0.397959 0.951220
3 0.493151 0.500 0.151068 0.066667 0.400000 0.333333 0.428571 0.0 0.5 1.0 0.00000 0.0 0.397959 0.951220
4 0.150685 0.500 0.221488 0.600000 0.800000 0.333333 0.714286 1.0 0.5 0.0 0.00000 0.0 0.397959 0.121951

We will not be using PCA for feature extraction because as we have seen before, a lot of columns have very high variance but not necessarity contribute much to the data. So, using PCA will be a bad idea because we might end up picking up high variance data that has nothing to do with our problem.

Train-validation-test splitting

In [75]:
x_train, x_test, y_train, y_test = train_test_split(feature_df, target, test_size=0.2)
In [76]:
print(x_train.shape,y_train.shape, x_test.shape, y_test.shape)
(26048, 14) (26048,) (6513, 14) (6513,)

Model Building


We shall build models to check if they perform well after out data preprocessing

Logistic Regression


We will start with Logistic Regression. Since the target column is a bivariate value, LogisticRegression can be used.

In [77]:
logistic_regressor = LogisticRegression()

logistic_regressor.fit(x_train, y_train)
Out[77]:
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)
In [78]:
logistic_train_score = logistic_regressor.score(x_train, y_train)
logistic_test_score = logistic_regressor.score(x_test, y_test)
logistic_prediction = logistic_regressor.predict(x_test)

print('Train Score: {0}\nTest Score: {1}'.format(logistic_train_score, logistic_test_score))
Train Score: 0.8220976658476659
Test Score: 0.8288039306003377
In [79]:
logistic_mse = mean_squared_error(y_test, logistic_prediction)
logistic_rmse = np.sqrt(logistic_mse)
print(logistic_mse, logistic_rmse)
0.17119606939966223 0.41375846746581785

We see that our logistic regression does not perform well. This is because we have multiple features which makes it difficult for Logistic regression to predict values. That is why we will be using other algorithms and testing them.

KNN Classifier


Before we start building our KNN model, we need to check for what value of K does our model have the least error. That will help us build a more optimal model

In [80]:
error_rate = []
# Will take some time
k_values = list(filter(lambda x: x%2==1, range(0,50)))
best_k = 0
for i in k_values:
    knn = KNeighborsClassifier(n_neighbors=i)
    knn.fit(x_train,y_train)
    pred_i = knn.predict(x_test)
    error_rate.append(np.mean(pred_i != y_test))
print(error_rate.index(np.min(error_rate)))
12

if index is 12, the value of k should be 2 * index + 1 Thus, The optimum value of K is 25.

This may change if we run this notebook again because we have not set any random state.
In [81]:
plt.figure(figsize=(10,10))
plt.plot(k_values,error_rate,color='blue', linestyle='dashed', marker='o',
         markerfacecolor='red', markersize=10)
plt.title('Error Rate vs. K Value')
plt.xlabel('K')
plt.ylabel('Error Rate')
Out[81]:
Text(0, 0.5, 'Error Rate')

We see that the value 19 is the value with the least error. Thus, We will take n_neighbors to be 25 for our model.

In [82]:
knn_classifier = KNeighborsClassifier(n_neighbors=25)
knn_classifier.fit(x_train, y_train)
Out[82]:
KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
                     metric_params=None, n_jobs=None, n_neighbors=21, p=2,
                     weights='uniform')
In [83]:
knn_train_score = knn_classifier.score(x_train, y_train)
knn_test_score = knn_classifier.score(x_test, y_test)

print('Train score: {}\nTest score: {}'.format(knn_train_score, knn_test_score))
Train score: 0.8459382678132679
Test score: 0.8315676339628435
In [84]:
knn_prediction = knn_classifier.predict(x_test)

knn_classifier_mse = mean_squared_error(y_test, knn_prediction)
knn_classifier_rmse = np.sqrt(knn_classifier_mse)

print('MSE: {}\nRMSE: {}'.format(knn_classifier_mse, knn_classifier_rmse))
MSE: 0.16843236603715644
RMSE: 0.4104051242822834

We see the KNN did a little better than Logistic regression. This can be because of the number of features we have. We shall try other algorithms and see if they perform better.

Support Vector Classifier

In [85]:
svc = SVC(kernel='rbf')
svc.fit(x_train, y_train)
Out[85]:
SVC(C=1.0, break_ties=False, cache_size=200, class_weight=None, coef0=0.0,
    decision_function_shape='ovr', degree=3, gamma='scale', kernel='rbf',
    max_iter=-1, probability=False, random_state=None, shrinking=True,
    tol=0.001, verbose=False)
In [86]:
svc_train_score = svc.score(x_train, y_train)
svc_test_score = svc.score(x_test, y_test)

print('Train score: {}\nTest score: {}'.format(svc_train_score, svc_test_score))
Train score: 0.8443642506142506
Test score: 0.8506064793489944
In [87]:
svc_prediction = svc.predict(x_test)

svc_mse = mean_squared_error(y_test, svc_prediction)
svc_rmse = np.sqrt(svc_mse)

print('MSE: {}\nRMSE: {}'.format(svc_mse, svc_rmse))
MSE: 0.14939352065100567
RMSE: 0.3865145801273293

Again, the accuracy here is pretty low. Maybe we need to revisit our feature extraction part again.

Decision Tree Classifier

In [88]:
dtree_classifier = DecisionTreeClassifier(min_impurity_decrease = 0.05)
dtree_classifier.fit(x_train, y_train)
Out[88]:
DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None, criterion='gini',
                       max_depth=None, max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.05, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, presort='deprecated',
                       random_state=None, splitter='best')
In [89]:
dtree_train_score = dtree_classifier.score(x_train, y_train)
dtree_test_score = dtree_classifier.score(x_test, y_test)

print('Train score: {}\nTest score: {}'.format(dtree_train_score, dtree_test_score))
Train score: 0.7576781326781327
Test score: 0.7652387532627054
In [90]:
dtree_prediction = dtree_classifier.predict(x_test)

dtree_mse = mean_squared_error(y_test, svc_prediction)
dtree_rmse = np.sqrt(dtree_mse)

print('MSE: {}\nRMSE: {}'.format(dtree_mse, dtree_rmse))
MSE: 0.14939352065100567
RMSE: 0.3865145801273293

Ensembling with Boosting:- AdaBoostClassifier

In [91]:
adaboost_classifier = AdaBoostClassifier(n_estimators=3)
adaboost_classifier.fit(x_train,y_train)
Out[91]:
AdaBoostClassifier(algorithm='SAMME.R', base_estimator=None, learning_rate=1.0,
                   n_estimators=3, random_state=None)
In [92]:
adaboost_train_score = adaboost_classifier.score(x_train,y_train)
adaboost_test_score = adaboost_classifier.score(x_test,y_test)
print('Train score: {}\nTest score: {}'.format(adaboost_train_score, adaboost_test_score))
Train score: 0.8374539312039312
Test score: 0.8426224474128666
In [93]:
adaboost_prediction = adaboost_classifier.predict(x_test)

adaboost_mse = mean_squared_error(y_test, adaboost_prediction)
adaboost_rmse = np.sqrt(adaboost_mse)

print('MSE: {}\nRMSE: {}'.format(adaboost_mse, adaboost_rmse))
MSE: 0.15737755258713343
RMSE: 0.3967083974245232

Ensembling with Bagging:- RandomForest Classifier

In [94]:
random_forest_classifier = RandomForestClassifier(n_estimators=20, min_samples_split=15, min_impurity_decrease=0.05)
random_forest_classifier.fit(x_train, y_train)
Out[94]:
RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='gini', max_depth=None, max_features='auto',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.05, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=15,
                       min_weight_fraction_leaf=0.0, n_estimators=20,
                       n_jobs=None, oob_score=False, random_state=None,
                       verbose=0, warm_start=False)
In [95]:
random_forest_train_score = random_forest_classifier.score(x_train,y_train)
random_forest_test_score = random_forest_classifier.score(x_test,y_test)
print('Train score: {}\nTest score: {}'.format(random_forest_train_score, random_forest_test_score))
Train score: 0.7576781326781327
Test score: 0.7652387532627054
In [96]:
random_forest_prediction = random_forest_classifier.predict(x_test)

random_forest_mse = mean_squared_error(y_test, random_forest_prediction)
random_forest_rmse = np.sqrt(random_forest_mse)

print('MSE: {}\nRMSE: {}'.format(random_forest_mse, random_forest_rmse))
MSE: 0.23476124673729465
RMSE: 0.4845216679750191

Conclusion

Analysis


As soon as we look at the dataset we realize that this is a US based survey. Mostly, people of the white and black ethinicity took part in the survey but other ethinicities were present as well. There was less biased data when it came to people belonging to the Asian-Pac-Island group where the ratio between people earning more than 50K and those earning less than the same was lower compared to other ethinicities. This dataset also containes information about more males than females. This may be because females did not prefer to take the survey. Also, this dataset is biased towards people making <=50K USD.

As we went through with the analysis, we found many interesting things. Most people usually go and find work right after their Highschool. However, some persue bachelors or higher studies like masters and doctorate or specializations tend to earn more. Some people don't even make it through highschool and these people almost always earn less than 50K which might be because of a lack of skill, education, exposure or more.

We noticed that there is barely any capital gain or capital loss for most people. Which leads us to believe that there is not a lot of growth in economy. However, the amount of gain people make is overwhelming compared to the amount they lose.

With our heatmap, we saw no mathematical correlation but from the other analysis methods we found some insightful information. From this survey we see that a lot of women earn less than 50K. It's just not women, but minorities in Race also seem to earn less.

People tend to work 40 hour weeks but it is not unusual to see people working a lot more or a lot less. And the age group of people working range from a young age of 17 to the age of over 90. It is interesting as people that old work as well. Both of these features combined give tell us that people older than 60 usually tend to work less. Most people who earn more than 50k either work long weeks or short weeks.

Model Evaluation

We see That none of the models perform very well and so we did not check for any other metric like classification report or the confusion matrix. This may be because of the features we chose. There is very high variance within the data which meant that we should not use PCA for dimentionality reduction as it would only choose features with high variance which doesn't necessarily mean that those features have anything to do with our target. We will have to use Neural networks or better feature extraction methods to get better results.