UCI Adult Data for Earning Potential of people¶

Determine if the person has an earning potential of more than 50K USD or less.

Importing libraries¶

import numpy as np
import pandas as pd

#visualization libraries
import matplotlib.pyplot as plt
import seaborn as sns

#preprocessing
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import MinMaxScaler

#Checking Z value for outlier treatment
from scipy import stats

from sklearn.decomposition import PCA

#splitting the dataset into training and testing. 60:20:20 
# 60% for training, 20% for validation and 20% for testing.
from sklearn.model_selection import train_test_split

#picking models for prediction.
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC

#ensemble models for better performance
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import RandomForestClassifier

#error evaluation
from sklearn.metrics import mean_squared_error

#ignore warning to make notebook prettier
import warnings
warnings.filterwarnings('ignore')

%matplotlib inline

#displays all rows and all columns without cutting anything.
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

Importing the dataset¶

path = '../dataset/adult-earning-potential/adult_data.csv'
adult_data = pd.read_csv(path)
adult_data.head().T

We see that the data has no head column. The headers are in another csv file. So, we will read that file and add it to the top of this csv file.

adult_header = pd.read_csv('../dataset/adult-earning-potential/adult_names_head.csv')
print(adult_header)
col_names = list(adult_header.columns)
print('col_names: {} \n length: {}'.format(col_names, len(col_names)))

Empty DataFrame
Columns: [age, workclass, fnlwgt, education, education-num, marital-status, occupation, capital-gain, capital-loss, hours-per-week, native-country]
Index: []
col_names: ['age', 'workclass', 'fnlwgt', 'education', 'education-num', 'marital-status', 'occupation', 'capital-gain', 'capital-loss', 'hours-per-week', 'native-country'] 
 length: 11

len(list(adult_data[0:1:-1])) #checks the number of columns

15

We see that our header file is missing 4 columns as the length of the columns of the header and the data file are 11 and 15 respectively. We have to recheck the datasets to see which columns we are missing.

After checking, we see that the final column header is:
['Age','Workclass','fnlwgt','Education','Education-num','Marital_Status','Occupation','Relationship','Race','Sex','Capital-gain','Capital-loss','hrs_per_week','Native-Country','Earning_potential']
We will read our data and add this list as our header.

data_header = ['Age','Workclass','fnlwgt','Education','Education-num','Marital_Status','Occupation','Relationship','Race','Sex','Capital-gain','Capital-loss','hrs_per_week','Native-Country','Earning_potential']
adult_data = pd.read_csv(path, names = data_header)
adult_data.head()

adult_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32561 entries, 0 to 32560
Data columns (total 15 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   Age                32561 non-null  int64 
 1   Workclass          32561 non-null  object
 2   fnlwgt             32561 non-null  int64 
 3   Education          32561 non-null  object
 4   Education-num      32561 non-null  int64 
 5   Marital_Status     32561 non-null  object
 6   Occupation         32561 non-null  object
 7   Relationship       32561 non-null  object
 8   Race               32561 non-null  object
 9   Sex                32561 non-null  object
 10  Capital-gain       32561 non-null  int64 
 11  Capital-loss       32561 non-null  int64 
 12  hrs_per_week       32561 non-null  int64 
 13  Native-Country     32561 non-null  object
 14  Earning_potential  32561 non-null  object
dtypes: int64(6), object(9)
memory usage: 2.6+ MB

We see that there are no null columns. But sometimes, the NaN values are replaced with other special characters. We need to check for this. We will generate random samples of a small size to see if we find any special characters in the data.

adult_data.sample(50)

From the random samples, we see that there are '?' in the dataset. These mean Null values. So we will search for the columns with '?' and then deal with them.

adult_data = adult_data.replace(to_replace = '%?%', value = np.nan) #replaces everything with a '?' with Nan
adult_data.isna().sum()

Age                  0
Workclass            0
fnlwgt               0
Education            0
Education-num        0
Marital_Status       0
Occupation           0
Relationship         0
Race                 0
Sex                  0
Capital-gain         0
Capital-loss         0
hrs_per_week         0
Native-Country       0
Earning_potential    0
dtype: int64

We now need to treat the missing data. For this, we will check the columns and replace those values with the Mode if they are categorical (if the mode is really high compared to other values) or the median if the value is numeric.
Before that we need to check for the types of data in each column (categorical or numeric)

all_columns = list(adult_data.columns)
print('all_columns:\n {}'.format(all_columns))

categorical_columns = list(adult_data.select_dtypes(include=['object']).columns)
print('Categorical columns:\n {}'.format(categorical_columns))

numerical_columns = list(adult_data.select_dtypes(include=['int64', 'float64']).columns)
print('Numerical columns:\n {}'.format(numerical_columns))

all_columns:
 ['Age', 'Workclass', 'fnlwgt', 'Education', 'Education-num', 'Marital_Status', 'Occupation', 'Relationship', 'Race', 'Sex', 'Capital-gain', 'Capital-loss', 'hrs_per_week', 'Native-Country', 'Earning_potential']
Categorical columns:
 ['Workclass', 'Education', 'Marital_Status', 'Occupation', 'Relationship', 'Race', 'Sex', 'Native-Country', 'Earning_potential']
Numerical columns:
 ['Age', 'fnlwgt', 'Education-num', 'Capital-gain', 'Capital-loss', 'hrs_per_week']

Now that we have checked the data and have found the numeric and categorical columns, we can now proceed with the analysis.

Data Exploration¶

adult_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32561 entries, 0 to 32560
Data columns (total 15 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   Age                32561 non-null  int64 
 1   Workclass          32561 non-null  object
 2   fnlwgt             32561 non-null  int64 
 3   Education          32561 non-null  object
 4   Education-num      32561 non-null  int64 
 5   Marital_Status     32561 non-null  object
 6   Occupation         32561 non-null  object
 7   Relationship       32561 non-null  object
 8   Race               32561 non-null  object
 9   Sex                32561 non-null  object
 10  Capital-gain       32561 non-null  int64 
 11  Capital-loss       32561 non-null  int64 
 12  hrs_per_week       32561 non-null  int64 
 13  Native-Country     32561 non-null  object
 14  Earning_potential  32561 non-null  object
dtypes: int64(6), object(9)
memory usage: 2.6+ MB

As we had replaced the'?' with null values, we see that the non-null sizes of each column is different. We have already calculated the null values of those missing columns. We will work on them too.

adult_data.describe().T

From the 5 number summary of the describe function, we see that the data in Capital-Gain and Capital-loss are vastly spread out with really high variance. This either means the data contains a large portion of the information, or, it has a lot of extreme values. In our case, these contain extreme values and these are 0 (zero) in a lot of rows.
The average of the age column is 38.58 and the median is 37 (50th percentile) implying that these people have families to take care of. This may be a factor to consider when dealing with the performance of the individual.

fnlwgt is the weight of people which usually depends on the demography they belong to so we should consider it with the country they belong to.

Hours of work per week also depends on the country and the field the person belongs to. Self employed people may work more or less hours than employees. Profession also determines the hours of work per week.

We will now explore the categorical columns.

len(categorical_columns)

9

Categorical Column Analysis¶

We will be checking for the count of the categorical data and getting inferences from it.

plt.figure(figsize = (15,10))
sns.countplot(adult_data['Workclass'])
plt.show()

We realize that most of the people in the survey belong to the Private sector. This is a biased data as we barely have any information on the other kind of workclasses.

adult_data['Workclass'].value_counts()

 Private             22696
 Self-emp-not-inc     2541
 Local-gov            2093
 ?                    1836
 State-gov            1298
 Self-emp-inc         1116
 Federal-gov           960
 Without-pay            14
 Never-worked            7
Name: Workclass, dtype: int64

adult_data.groupby(['Workclass', 'Earning_potential']).size()

Workclass          Earning_potential
 ?                  <=50K                1645
                    >50K                  191
 Federal-gov        <=50K                 589
                    >50K                  371
 Local-gov          <=50K                1476
                    >50K                  617
 Never-worked       <=50K                   7
 Private            <=50K               17733
                    >50K                 4963
 Self-emp-inc       <=50K                 494
                    >50K                  622
 Self-emp-not-inc   <=50K                1817
                    >50K                  724
 State-gov          <=50K                 945
                    >50K                  353
 Without-pay        <=50K                  14
dtype: int64

Everyone in the survey who has worked without pay earns <=50 because... well... They aren't paid xD
Same goes for people who have never worked. Maybe their families are filthy rich.
We do see that the private sector that employes most of the people in the survey also has over 3 times the number of employees who earn less than 50k compared to their counterparts who earn more than 50k.
This may be because our data is biased.

plt.figure(figsize = (15,10))
sns.countplot(adult_data[categorical_columns[1]])
plt.show()

This is an interesting graph as most of the people in the survey have either attended HS-grad, have a Bachelor's degree or have attended some collage.

adult_data['Education'].value_counts()

 HS-grad         10501
 Some-college     7291
 Bachelors        5355
 Masters          1723
 Assoc-voc        1382
 11th             1175
 Assoc-acdm       1067
 10th              933
 7th-8th           646
 Prof-school       576
 9th               514
 12th              433
 Doctorate         413
 5th-6th           333
 1st-4th           168
 Preschool          51
Name: Education, dtype: int64

adult_data.groupby(['Education', 'Earning_potential']).size()

Education      Earning_potential
 10th           <=50K                871
                >50K                  62
 11th           <=50K               1115
                >50K                  60
 12th           <=50K                400
                >50K                  33
 1st-4th        <=50K                162
                >50K                   6
 5th-6th        <=50K                317
                >50K                  16
 7th-8th        <=50K                606
                >50K                  40
 9th            <=50K                487
                >50K                  27
 Assoc-acdm     <=50K                802
                >50K                 265
 Assoc-voc      <=50K               1021
                >50K                 361
 Bachelors      <=50K               3134
                >50K                2221
 Doctorate      <=50K                107
                >50K                 306
 HS-grad        <=50K               8826
                >50K                1675
 Masters        <=50K                764
                >50K                 959
 Preschool      <=50K                 51
 Prof-school    <=50K                153
                >50K                 423
 Some-college   <=50K               5904
                >50K                1387
dtype: int64

From this we understand that most people with lower level of education usually earn less than 50k. There are exceptions however and that may be so because of their experiences or learning things on their own. Or just being really good at their trade.
Only for bachelors we see the gap between people earning >50k and <=50k is fairly low. That may be so because bachelors is a degree with many trades involved and talent and hardwork usually pays off real well.
Everyone who has not studied beyond preschool earns less than 50k
On the other hand people who have persued higher education like Masters and Doctorate are more likely to earn >50k. We need to analyze other factors to trim this down.

adult_data.groupby(['Education', 'Workclass']).size()

Education      Workclass        
 10th           ?                    100
                Federal-gov            6
                Local-gov             31
                Never-worked           2
                Private              695
                Self-emp-inc          19
                Self-emp-not-inc      67
                State-gov             13
 11th           ?                    118
                Federal-gov            9
                Local-gov             36
                Never-worked           1
                Private              923
                Self-emp-inc          14
                Self-emp-not-inc      60
                State-gov             14
 12th           ?                     40
                Federal-gov            5
                Local-gov             19
                Private              333
                Self-emp-inc           7
                Self-emp-not-inc      19
                State-gov             10
 1st-4th        ?                     12
                Local-gov              4
                Private              136
                Self-emp-inc           2
                Self-emp-not-inc      13
                State-gov              1
 5th-6th        ?                     30
                Federal-gov            1
                Local-gov              9
                Private              266
                Self-emp-inc           4
                Self-emp-not-inc      19
                State-gov              4
 7th-8th        ?                     72
                Federal-gov            2
                Local-gov             28
                Never-worked           1
                Private              424
                Self-emp-inc          14
                Self-emp-not-inc      94
                State-gov             10
                Without-pay            1
 9th            ?                     51
                Federal-gov            3
                Local-gov             23
                Private              387
                Self-emp-inc          10
                Self-emp-not-inc      34
                State-gov              6
 Assoc-acdm     ?                     47
                Federal-gov           55
                Local-gov             88
                Private              729
                Self-emp-inc          35
                Self-emp-not-inc      71
                State-gov             41
                Without-pay            1
 Assoc-voc      ?                     61
                Federal-gov           38
                Local-gov             86
                Private             1005
                Self-emp-inc          38
                Self-emp-not-inc     108
                State-gov             46
 Bachelors      ?                    173
                Federal-gov          212
                Local-gov            477
                Private             3551
                Self-emp-inc         273
                Self-emp-not-inc     399
                State-gov            270
 Doctorate      ?                     15
                Federal-gov           16
                Local-gov             27
                Private              181
                Self-emp-inc          35
                Self-emp-not-inc      50
                State-gov             89
 HS-grad        ?                    532
                Federal-gov          263
                Local-gov            503
                Never-worked           1
                Private             7780
                Self-emp-inc         279
                Self-emp-not-inc     866
                State-gov            268
                Without-pay            9
 Masters        ?                     48
                Federal-gov           67
                Local-gov            342
                Private              894
                Self-emp-inc          79
                Self-emp-not-inc     124
                State-gov            169
 Preschool      ?                      5
                Local-gov              4
                Private               41
                State-gov              1
 Prof-school    ?                     18
                Federal-gov           29
                Local-gov             29
                Private              257
                Self-emp-inc          81
                Self-emp-not-inc     131
                State-gov             31
 Some-college   ?                    514
                Federal-gov          254
                Local-gov            387
                Never-worked           2
                Private             5094
                Self-emp-inc         226
                Self-emp-not-inc     486
                State-gov            325
                Without-pay            3
dtype: int64

This gives us insigts on the relationship between workclass and education.
we see that the only people who work without pay are the ones who had Assoc-acdm, are HS-Grads or those who went to some collage.
We also see that no matter the educational level, the highest number of people in each category work in the private sector.
The mode of each of those columns is of the private sector as well.

plt.figure(figsize = (15,10))
sns.countplot(adult_data[categorical_columns[2]])
plt.show()

adult_data['Marital_Status'].value_counts()

 Married-civ-spouse       14976
 Never-married            10683
 Divorced                  4443
 Separated                 1025
 Widowed                    993
 Married-spouse-absent      418
 Married-AF-spouse           23
Name: Marital_Status, dtype: int64

adult_data.groupby(['Marital_Status', 'Earning_potential']).size()

Marital_Status          Earning_potential
 Divorced                <=50K                3980
                         >50K                  463
 Married-AF-spouse       <=50K                  13
                         >50K                   10
 Married-civ-spouse      <=50K                8284
                         >50K                 6692
 Married-spouse-absent   <=50K                 384
                         >50K                   34
 Never-married           <=50K               10192
                         >50K                  491
 Separated               <=50K                 959
                         >50K                   66
 Widowed                 <=50K                 908
                         >50K                   85
dtype: int64

We see that when a person is married and has a civilian spouce, they tend to have a lower difference between the two classses of income potential.

The biggest difference is among the people who are never married. Almost all of them earn less than 50k. This may be because they are relatively younger so they have less experience or there can be a range of other factors (like education).

adult_data.groupby(['Marital_Status', 'Workclass']).size()

Marital_Status          Workclass        
 Divorced                ?                    184
                         Federal-gov          168
                         Local-gov            369
                         Never-worked           1
                         Private             3119
                         Self-emp-inc         100
                         Self-emp-not-inc     292
                         State-gov            210
 Married-AF-spouse       ?                      2
                         Federal-gov            3
                         Private               15
                         Self-emp-not-inc       2
                         State-gov              1
 Married-civ-spouse      ?                    636
                         Federal-gov          471
                         Local-gov           1023
                         Never-worked           1
                         Private             9732
                         Self-emp-inc         837
                         Self-emp-not-inc    1680
                         State-gov            588
                         Without-pay            8
 Married-spouse-absent   ?                     29
                         Federal-gov           11
                         Local-gov             22
                         Private              302
                         Self-emp-inc           5
                         Self-emp-not-inc      31
                         State-gov             17
                         Without-pay            1
 Never-married           ?                    766
                         Federal-gov          245
                         Local-gov            530
                         Never-worked           5
                         Private             8186
                         Self-emp-inc         125
                         Self-emp-not-inc     409
                         State-gov            413
                         Without-pay            4
 Separated               ?                     66
                         Federal-gov           26
                         Local-gov             63
                         Private              754
                         Self-emp-inc          20
                         Self-emp-not-inc      53
                         State-gov             43
 Widowed                 ?                    153
                         Federal-gov           36
                         Local-gov             86
                         Private              588
                         Self-emp-inc          29
                         Self-emp-not-inc      74
                         State-gov             26
                         Without-pay            1
dtype: int64

We again see that in all the marital statuses, private sector is where most people work. It usually is the case with an overwhelming majority.
Except for people with a married civilian spouse where local-govt and self-employed-not-inc had 1600+ and 1000+ entries with private sectors were at 9.7k.

adult_data.groupby(['Marital_Status', 'Education']).size()

Marital_Status          Education    
 Divorced                10th             120
                         11th             130
                         12th              39
                         1st-4th           10
                         5th-6th           20
                         7th-8th           73
                         9th               64
                         Assoc-acdm       203
                         Assoc-voc        234
                         Bachelors        546
                         Doctorate         33
                         HS-grad         1613
                         Masters          233
                         Preschool          1
                         Prof-school       55
                         Some-college    1069
 Married-AF-spouse       Assoc-acdm         2
                         Assoc-voc          1
                         Bachelors          4
                         HS-grad           13
                         Some-college       3
 Married-civ-spouse      10th             349
                         11th             354
                         12th             130
                         1st-4th           81
                         5th-6th          172
                         7th-8th          359
                         9th              230
                         Assoc-acdm       460
                         Assoc-voc        689
                         Bachelors       2768
                         Doctorate        286
                         HS-grad         4845
                         Masters         1003
                         Preschool         20
                         Prof-school      412
                         Some-college    2818
 Married-spouse-absent   10th              15
                         11th              19
                         12th               8
                         1st-4th           12
                         5th-6th           20
                         7th-8th           14
                         9th                9
                         Assoc-acdm        12
                         Assoc-voc         13
                         Bachelors         68
                         Doctorate          7
                         HS-grad          121
                         Masters           17
                         Preschool          4
                         Prof-school        3
                         Some-college      76
 Never-married           10th             361
                         11th             586
                         12th             232
                         1st-4th           39
                         5th-6th           89
                         7th-8th          113
                         9th              155
                         Assoc-acdm       337
                         Assoc-voc        362
                         Bachelors       1795
                         Doctorate         73
                         HS-grad         3089
                         Masters          404
                         Preschool         22
                         Prof-school       93
                         Some-college    2933
 Separated               10th              49
                         11th              48
                         12th              14
                         1st-4th            9
                         5th-6th           18
                         7th-8th           23
                         9th               33
                         Assoc-acdm        30
                         Assoc-voc         42
                         Bachelors         92
                         Doctorate          7
                         HS-grad          406
                         Masters           25
                         Preschool          1
                         Prof-school        8
                         Some-college     220
 Widowed                 10th              39
                         11th              38
                         12th              10
                         1st-4th           17
                         5th-6th           14
                         7th-8th           64
                         9th               23
                         Assoc-acdm        23
                         Assoc-voc         41
                         Bachelors         82
                         Doctorate          7
                         HS-grad          414
                         Masters           41
                         Preschool          3
                         Prof-school        5
                         Some-college     172
dtype: int64

Most people who did their masters, bachelors or went to some collage tend to have civilian spouses. Bachelors and people who went to some collage are also amongst those who were not married.
This may be because of the number of people with bachelors, hs-grads and people going to some collage are higher than most other recordings. However, a lot of the people who did their masters did prefer civilian spouses.

plt.figure(figsize=(25,15))
sns.countplot(adult_data['Native-Country'])
plt.show()

adult_data['Native-Country'].value_counts()

 United-States                 29170
 Mexico                          643
 ?                               583
 Philippines                     198
 Germany                         137
 Canada                          121
 Puerto-Rico                     114
 El-Salvador                     106
 India                           100
 Cuba                             95
 England                          90
 Jamaica                          81
 South                            80
 China                            75
 Italy                            73
 Dominican-Republic               70
 Vietnam                          67
 Guatemala                        64
 Japan                            62
 Poland                           60
 Columbia                         59
 Taiwan                           51
 Haiti                            44
 Iran                             43
 Portugal                         37
 Nicaragua                        34
 Peru                             31
 France                           29
 Greece                           29
 Ecuador                          28
 Ireland                          24
 Hong                             20
 Trinadad&Tobago                  19
 Cambodia                         19
 Laos                             18
 Thailand                         18
 Yugoslavia                       16
 Outlying-US(Guam-USVI-etc)       14
 Hungary                          13
 Honduras                         13
 Scotland                         12
 Holand-Netherlands                1
Name: Native-Country, dtype: int64

That is a highly biased dataset. So we can just replace all the null values with US and it will be fine.
We also notice that there is no inconsistency in the data (like US, USA, United States in one dataset) so we do not need to worry about it.

adult_data['Occupation'].value_counts()

 Prof-specialty       4140
 Craft-repair         4099
 Exec-managerial      4066
 Adm-clerical         3770
 Sales                3650
 Other-service        3295
 Machine-op-inspct    2002
 ?                    1843
 Transport-moving     1597
 Handlers-cleaners    1370
 Farming-fishing       994
 Tech-support          928
 Protective-serv       649
 Priv-house-serv       149
 Armed-Forces            9
Name: Occupation, dtype: int64

Occupation does not have a biased report so we can check for it's connections with other columns to see if we can replace the null values with anything.

adult_data.groupby(['Occupation', 'Education']).size()

Occupation          Education    
 ?                   10th             102
                     11th             119
                     12th              40
                     1st-4th           12
                     5th-6th           30
                     7th-8th           73
                     9th               51
                     Assoc-acdm        47
                     Assoc-voc         61
                     Bachelors        173
                     Doctorate         15
                     HS-grad          533
                     Masters           48
                     Preschool          5
                     Prof-school       18
                     Some-college     516
 Adm-clerical        10th              38
                     11th              67
                     12th              38
                     5th-6th            6
                     7th-8th           11
                     9th               14
                     Assoc-acdm       193
                     Assoc-voc        167
                     Bachelors        506
                     Doctorate          5
                     HS-grad         1365
                     Masters           68
                     Preschool          2
                     Prof-school        9
                     Some-college    1281
 Armed-Forces        12th               1
                     Bachelors          1
                     HS-grad            4
                     Masters            1
                     Some-college       2
 Craft-repair        10th             170
                     11th             175
                     12th              58
                     1st-4th           23
                     5th-6th           43
                     7th-8th          116
                     9th               96
                     Assoc-acdm       115
                     Assoc-voc        252
                     Bachelors        226
                     Doctorate          2
                     HS-grad         1922
                     Masters           22
                     Preschool          4
                     Prof-school        7
                     Some-college     868
 Exec-managerial     10th              24
                     11th              34
                     12th              13
                     1st-4th            4
                     5th-6th            1
                     7th-8th           19
                     9th               13
                     Assoc-acdm       145
                     Assoc-voc        150
                     Bachelors       1369
                     Doctorate         55
                     HS-grad          807
                     Masters          501
                     Prof-school       52
                     Some-college     879
 Farming-fishing     10th              44
                     11th              37
                     12th              16
                     1st-4th           18
                     5th-6th           36
                     7th-8th           70
                     9th               28
                     Assoc-acdm        14
                     Assoc-voc         52
                     Bachelors         77
                     Doctorate          1
                     HS-grad          404
                     Masters           10
                     Preschool          9
                     Prof-school        4
                     Some-college     174
 Handlers-cleaners   10th              71
                     11th             123
                     12th              38
                     1st-4th           16
                     5th-6th           40
                     7th-8th           46
                     9th               49
                     Assoc-acdm        24
                     Assoc-voc         28
                     Bachelors         50
                     HS-grad          611
                     Masters            5
                     Preschool          2
                     Some-college     267
 Machine-op-inspct   10th             101
                     11th              99
                     12th              35
                     1st-4th           23
                     5th-6th           56
                     7th-8th           93
                     9th               76
                     Assoc-acdm        33
                     Assoc-voc         63
                     Bachelors         69
                     Doctorate          1
                     HS-grad         1023
                     Masters            8
                     Preschool         11
                     Prof-school        1
                     Some-college     310
 Other-service       10th             194
                     11th             238
                     12th              85
                     1st-4th           40
                     5th-6th           64
                     7th-8th           98
                     9th              101
                     Assoc-acdm        78
                     Assoc-voc        115
                     Bachelors        181
                     Doctorate          1
                     HS-grad         1281
                     Masters           19
                     Preschool         15
                     Prof-school        4
                     Some-college     781
 Priv-house-serv     10th               6
                     11th              14
                     12th               4
                     1st-4th           11
                     5th-6th           14
                     7th-8th            8
                     9th               10
                     Assoc-acdm         2
                     Assoc-voc          4
                     Bachelors          7
                     HS-grad           50
                     Masters            1
                     Preschool          2
                     Some-college      16
 Prof-specialty      10th               9
                     11th              20
                     12th              10
                     1st-4th            4
                     5th-6th            1
                     7th-8th            9
                     9th                3
                     Assoc-acdm       138
                     Assoc-voc        170
                     Bachelors       1495
                     Doctorate        321
                     HS-grad          233
                     Masters          844
                     Preschool          1
                     Prof-school      452
                     Some-college     430
 Protective-serv     10th               6
                     11th               7
                     12th               6
                     1st-4th            1
                     5th-6th            1
                     7th-8th            9
                     9th                4
                     Assoc-acdm        34
                     Assoc-voc         48
                     Bachelors        100
                     HS-grad          215
                     Masters           15
                     Prof-school        1
                     Some-college     202
 Sales               10th              81
                     11th             144
                     12th              47
                     1st-4th            8
                     5th-6th           12
                     7th-8th           29
                     9th               32
                     Assoc-acdm       144
                     Assoc-voc        106
                     Bachelors        809
                     Doctorate          8
                     HS-grad         1069
                     Masters          134
                     Prof-school       18
                     Some-college    1009
 Tech-support        10th               3
                     11th               6
                     12th               3
                     5th-6th            1
                     7th-8th            5
                     9th                2
                     Assoc-acdm        73
                     Assoc-voc        126
                     Bachelors        230
                     Doctorate          3
                     HS-grad          159
                     Masters           37
                     Prof-school        7
                     Some-college     273
 Transport-moving    10th              84
                     11th              92
                     12th              39
                     1st-4th            8
                     5th-6th           28
                     7th-8th           60
                     9th               35
                     Assoc-acdm        27
                     Assoc-voc         40
                     Bachelors         62
                     Doctorate          1
                     HS-grad          825
                     Masters           10
                     Prof-school        3
                     Some-college     283
dtype: int64

As we had expected, Bachelors, HS-Grads and some-collage rule this side as well. But there are some interesting things we found.
In Tech SUpport, Bachelors, Some-collage are mostly present with HS-grads coming in third followed by Assoc-voc.
Bachelors, Masters and Doctorates prefer Prof-speciality fields of work. Most masters and Doctorates work in this field. They are also the ones who usually get paid >50k so this makes a lot of sense.

adult_data.groupby(['Occupation', 'Workclass']).size()

Occupation          Workclass        
 ?                   ?                   1836
                     Never-worked           7
 Adm-clerical        Federal-gov          317
                     Local-gov            283
                     Private             2833
                     Self-emp-inc          31
                     Self-emp-not-inc      50
                     State-gov            253
                     Without-pay            3
 Armed-Forces        Federal-gov            9
 Craft-repair        Federal-gov           64
                     Local-gov            146
                     Private             3195
                     Self-emp-inc         106
                     Self-emp-not-inc     531
                     State-gov             56
                     Without-pay            1
 Exec-managerial     Federal-gov          180
                     Local-gov            214
                     Private             2691
                     Self-emp-inc         400
                     Self-emp-not-inc     392
                     State-gov            189
 Farming-fishing     Federal-gov            8
                     Local-gov             29
                     Private              455
                     Self-emp-inc          51
                     Self-emp-not-inc     430
                     State-gov             15
                     Without-pay            6
 Handlers-cleaners   Federal-gov           23
                     Local-gov             47
                     Private             1273
                     Self-emp-inc           2
                     Self-emp-not-inc      15
                     State-gov              9
                     Without-pay            1
 Machine-op-inspct   Federal-gov           14
                     Local-gov             12
                     Private             1913
                     Self-emp-inc          13
                     Self-emp-not-inc      36
                     State-gov             13
                     Without-pay            1
 Other-service       Federal-gov           35
                     Local-gov            193
                     Private             2740
                     Self-emp-inc          27
                     Self-emp-not-inc     175
                     State-gov            124
                     Without-pay            1
 Priv-house-serv     Private              149
 Prof-specialty      Federal-gov          175
                     Local-gov            705
                     Private             2313
                     Self-emp-inc         160
                     Self-emp-not-inc     373
                     State-gov            414
 Protective-serv     Federal-gov           28
                     Local-gov            304
                     Private              190
                     Self-emp-inc           5
                     Self-emp-not-inc       6
                     State-gov            116
 Sales               Federal-gov           14
                     Local-gov              7
                     Private             2942
                     Self-emp-inc         291
                     Self-emp-not-inc     385
                     State-gov             11
 Tech-support        Federal-gov           68
                     Local-gov             38
                     Private              736
                     Self-emp-inc           3
                     Self-emp-not-inc      26
                     State-gov             57
 Transport-moving    Federal-gov           25
                     Local-gov            115
                     Private             1266
                     Self-emp-inc          27
                     Self-emp-not-inc     122
                     State-gov             41
                     Without-pay            1
dtype: int64

Here again, Private sector is the most preferred sector except for Farming-Fishing where self-emp-not-inc are present in abundence. Since they are usually not the ones who get paid >50k, farming is probably not a very profitable. Most of the people who work without pay also stay in this trade.

plt.figure(figsize=(15,10))
sns.countplot(adult_data['Race'])

<matplotlib.axes._subplots.AxesSubplot at 0x14101430>

adult_data.groupby(['Race', 'Earning_potential']).size()

Race                 Earning_potential
 Amer-Indian-Eskimo   <=50K                 275
                      >50K                   36
 Asian-Pac-Islander   <=50K                 763
                      >50K                  276
 Black                <=50K                2737
                      >50K                  387
 Other                <=50K                 246
                      >50K                   25
 White                <=50K               20699
                      >50K                 7117
dtype: int64

This is a white dominated dataset and we see that the ratio of race with >50k and <50k is: aie - 7.6 api - 2.7 b - 7.0 o - 9.8 w - 2.9

This concludes that White and Asian-Pac_Islanders have a less ratio between the people earning >50k and <=50k while it is more prominent in the rest of the races, especially for the people tagged 'Other'
This may be because of education. So let's check that.

adult_data.groupby(['Race', 'Education']).size()

Race                 Education    
 Amer-Indian-Eskimo   10th              16
                      11th              14
                      12th               5
                      1st-4th            4
                      5th-6th            2
                      7th-8th            9
                      9th                5
                      Assoc-acdm         8
                      Assoc-voc         19
                      Bachelors         21
                      Doctorate          3
                      HS-grad          119
                      Masters            5
                      Prof-school        2
                      Some-college      79
 Asian-Pac-Islander   10th              13
                      11th              21
                      12th               9
                      1st-4th            5
                      5th-6th           18
                      7th-8th           11
                      9th                9
                      Assoc-acdm        29
                      Assoc-voc         38
                      Bachelors        289
                      Doctorate         28
                      HS-grad          226
                      Masters           88
                      Preschool          6
                      Prof-school       41
                      Some-college     208
 Black                10th             133
                      11th             153
                      12th              70
                      1st-4th           16
                      5th-6th           21
                      7th-8th           56
                      9th               89
                      Assoc-acdm       107
                      Assoc-voc        112
                      Bachelors        330
                      Doctorate         11
                      HS-grad         1174
                      Masters           86
                      Preschool          5
                      Prof-school       15
                      Some-college     746
 Other                10th               9
                      11th              10
                      12th              14
                      1st-4th            9
                      5th-6th           13
                      7th-8th           17
                      9th                8
                      Assoc-acdm         8
                      Assoc-voc          6
                      Bachelors         33
                      Doctorate          2
                      HS-grad           78
                      Masters            7
                      Preschool          2
                      Prof-school        4
                      Some-college      51
 White                10th             762
                      11th             977
                      12th             335
                      1st-4th          134
                      5th-6th          279
                      7th-8th          553
                      9th              403
                      Assoc-acdm       915
                      Assoc-voc       1207
                      Bachelors       4682
                      Doctorate        369
                      HS-grad         8904
                      Masters         1537
                      Preschool         38
                      Prof-school      514
                      Some-college    6207
dtype: int64

We see that Most white people are HS-Grads and a lot of them have gone to some collage. The next highs are Bachelors and masters.
We see that most people belonging to the Other category do not have many high degrees and that might explain the low income potential.
We see a lot of people belonging to the Black category are HS-Grads. Since HS-Grads have a high number of people with income potential <=50k, this may explain why the ratio is so high.
For asia-pac-islanders, a lot of people in their category have done masters (compared to other degrees it isnt that low) which may have contributed to their higher income potential.

plt.figure(figsize=(15,10))
sns.countplot(adult_data['Sex'])

<matplotlib.axes._subplots.AxesSubplot at 0x14ae7f58>

adult_data.groupby(['Sex', 'Earning_potential']).size()

Sex      Earning_potential
 Female   <=50K                9592
          >50K                 1179
 Male     <=50K               15128
          >50K                 6662
dtype: int64

We see that when it comes to the ratio of Males being paid >50k and <-50k is around 2.3 but for women, the same ratio drops down to 8.1. This implies that women are being paid less. We have to check the same for every occupation to see if women work in occupations that usually pay less. This can go both ways.

adult_data.groupby(['Education', 'Sex']).size()

Education      Sex    
 10th           Female     295
                Male       638
 11th           Female     432
                Male       743
 12th           Female     144
                Male       289
 1st-4th        Female      46
                Male       122
 5th-6th        Female      84
                Male       249
 7th-8th        Female     160
                Male       486
 9th            Female     144
                Male       370
 Assoc-acdm     Female     421
                Male       646
 Assoc-voc      Female     500
                Male       882
 Bachelors      Female    1619
                Male      3736
 Doctorate      Female      86
                Male       327
 HS-grad        Female    3390
                Male      7111
 Masters        Female     536
                Male      1187
 Preschool      Female      16
                Male        35
 Prof-school    Female      92
                Male       484
 Some-college   Female    2806
                Male      4485
dtype: int64

We see that women are a lot more prominent in the lower levels of education and then in bachelors, hsgrads and some-collage. This implies a lot of women quit during school.

adult_data.groupby(['Occupation', 'Sex']).size()

Occupation          Sex    
 ?                   Female     841
                     Male      1002
 Adm-clerical        Female    2537
                     Male      1233
 Armed-Forces        Male         9
 Craft-repair        Female     222
                     Male      3877
 Exec-managerial     Female    1159
                     Male      2907
 Farming-fishing     Female      65
                     Male       929
 Handlers-cleaners   Female     164
                     Male      1206
 Machine-op-inspct   Female     550
                     Male      1452
 Other-service       Female    1800
                     Male      1495
 Priv-house-serv     Female     141
                     Male         8
 Prof-specialty      Female    1515
                     Male      2625
 Protective-serv     Female      76
                     Male       573
 Sales               Female    1263
                     Male      2387
 Tech-support        Female     348
                     Male       580
 Transport-moving    Female      90
                     Male      1507
dtype: int64

Women dominate the adm-clerical, other services, priv-house-ervices. However in all the services, a majority of people get paid <=50k.
We may need to check the numeric columns and then plot a heatmap and check the corelation of all the columns with the target in order to get a better idea about the dataset.

Numerical Column Analysis¶

numerical_columns

['Age',
 'fnlwgt',
 'Education-num',
 'Capital-gain',
 'Capital-loss',
 'hrs_per_week']

# for i in range(len(numerical_columns)):
#     plt.figure(figsize=(15,10))
#     sns.distplot(adult_data[numerical_columns[i]])
# plt.show() 

# Too many graphs

# for i in range(len(numerical_columns)):
#     plt.figure(figsize=(15,10))
#     sns.boxplot(adult_data[numerical_columns[i]])
# plt.show() 

# Too many gtaphs

adult_data.var(axis=0)
adult_data.loc[:, numerical_columns].var()

Age              1.860614e+02
fnlwgt           1.114080e+10
Education-num    6.618890e+00
Capital-gain     5.454254e+07
Capital-loss     1.623769e+05
hrs_per_week     1.524590e+02
dtype: float64

let's make our lives easier and convert it to a float value.

var_in_float = adult_data.loc[:, numerical_columns].var()
for i in range(len(numerical_columns)):
    print('{} \t\t {}'.format(numerical_columns[i], round(float(var_in_float[i]), 3)))

Age 		 186.061
fnlwgt 		 11140797791.842
Education-num 		 6.619
Capital-gain 		 54542539.178
Capital-loss 		 162376.938
hrs_per_week 		 152.459

We see that fnlwgt, capital-gain and capital loss have the highest variance. This can occur either because these have a lot of information or... they have few, very extreme values. Let's check those out.

plt.figure(figsize=(15,10))
sns.distplot(adult_data['fnlwgt'])
plt.show()

Do not be fooled by the tiny 0.2 steps because at the end, there's a 1e6. It means it is 1.0e+06 or 1000000 or 1 * 10^6 So that 0.2 is actually 200000. That would explain the high variance.

plt.figure(figsize=(15,10))
sns.boxplot(adult_data['fnlwgt'])
plt.show()

We see that there are a large number of outliers here. Our median lies between the 0.2 * 10^6 side but a lot of other points cross our 75th percentile. We will have to treat this column for outliers.

plt.figure(figsize=(15,10))
sns.distplot(adult_data['Capital-gain'])
plt.show()

This graph is quite interesting. Most of our data tends towards the zero side of the graph. However, some of the data is in the 5k-20k range and there is some data in the 100,000 range as well! Now that outlier right there would throw off our variance by a lot. We need to deal with those outliers eventually or when we try to make models later, we will not be able to make a good prediction.

plt.figure(figsize=(15,10))
sns.boxplot(adult_data['Capital-gain'])
plt.show()

That just looks like a lot of outliers as almost all of our data was centered towards 0 impling very few people got a capital gain. Without much capital gain, it is difficult to break the <=50k barrier. That would help explain why so many people in the survey had a income potential of <=50k.

plt.figure(figsize=(15,10))
sns.distplot(adult_data['Capital-loss'])
plt.show()

We again see the data is centered towards 0 with some outliers near 2000. We will have to clean or scale this data.

plt.figure(figsize=(15,10))
sns.boxplot(adult_data['Capital-loss'])
plt.show()

Again, there are a large amount of people with no capital loss. We also saw a large amount of people do not have any capital gain either. So maybe people in our sample do not invest or have passive income or take risks and so on. This is kind of sad to see. But at least there were not large losses. The highest loss we see is somewhere in the range of 5000.

plt.figure(figsize=(15,10))
sns.distplot(adult_data['Education-num'])
plt.show()

plt.figure(figsize=(15,10))
sns.boxplot(adult_data['Education-num'])
plt.show()

We see that most of the people fall within the 9-12 range with data skewed towards the left. Which is the HS-grad part of the graph. And few people are well below our education number threshold at 4 to 1.

plt.figure(figsize=(15,10))
sns.distplot(adult_data['hrs_per_week'])
plt.show()

We see that a lot of people work around 40 hours per week. We see there are some people who work towards the 0 side of the graph. They may be people who work without pay and those who do not get paid at all. We will check for those as well.

plt.figure(figsize=(15,10))
sns.boxplot(adult_data['hrs_per_week'])
plt.show()

We see a lot of people work 40 hour weeks but that ranges from 30-50. However a lot of people people work a lot longer and less than that. Some even work 100 hour weeks! They are either very passionate, or are having a very bad time.

plt.figure(figsize=(15,10))
sns.distplot(adult_data['Age'])
plt.show()

We see the Age is right skewed as more people at an younger age work in this survey.

plt.figure(figsize=(15,10))
sns.boxplot(adult_data['Age'])
plt.show()

We see that most people who work are within the age group of 17 to a little less than 80. 80 is... well hats off to them and to those who work well beyond that up to their early 90s. It is facinating to see people work till that age. They must be very passionate about what they do. Or there may be something sadder at play.

Correlation between the numeric columns.

adult_data.corr()

We see that we see no direct corelation between any of the data. This does not mean that none of the data here is correlated, we just haven't been able to find that correlation yet.
Lets encode and scale our data. That way, our models will have an easier time working with our data.

Preprocessing¶

Null Value Treatment¶

Filling the null columns.

null_columns = adult_data.columns[adult_data.isnull().any()]
adult_data[null_columns].isnull().sum()

Series([], dtype: float64)

Checking for mode of the Null columns. We can directly check for the mode instead of checking for tyhe type of the columns because we have previously established at the beginning that only 3 categorical columns have null values.

# adult_data.loc[:, null_columns].mode()

#checking for dataset info before replacing columns.
adult_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32561 entries, 0 to 32560
Data columns (total 15 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   Age                32561 non-null  int64 
 1   Workclass          32561 non-null  object
 2   fnlwgt             32561 non-null  int64 
 3   Education          32561 non-null  object
 4   Education-num      32561 non-null  int64 
 5   Marital_Status     32561 non-null  object
 6   Occupation         32561 non-null  object
 7   Relationship       32561 non-null  object
 8   Race               32561 non-null  object
 9   Sex                32561 non-null  object
 10  Capital-gain       32561 non-null  int64 
 11  Capital-loss       32561 non-null  int64 
 12  hrs_per_week       32561 non-null  int64 
 13  Native-Country     32561 non-null  object
 14  Earning_potential  32561 non-null  object
dtypes: int64(6), object(9)
memory usage: 2.6+ MB

for i in list(null_columns):
    adult_data[i].fillna(adult_data[i].mode().values[0],inplace=True)

print('{null_sum} \n\n {adult_data_info}'.format(null_sum=adult_data.isna().sum(), adult_data_info=adult_data.info()))

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32561 entries, 0 to 32560
Data columns (total 15 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   Age                32561 non-null  int64 
 1   Workclass          32561 non-null  object
 2   fnlwgt             32561 non-null  int64 
 3   Education          32561 non-null  object
 4   Education-num      32561 non-null  int64 
 5   Marital_Status     32561 non-null  object
 6   Occupation         32561 non-null  object
 7   Relationship       32561 non-null  object
 8   Race               32561 non-null  object
 9   Sex                32561 non-null  object
 10  Capital-gain       32561 non-null  int64 
 11  Capital-loss       32561 non-null  int64 
 12  hrs_per_week       32561 non-null  int64 
 13  Native-Country     32561 non-null  object
 14  Earning_potential  32561 non-null  object
dtypes: int64(6), object(9)
memory usage: 2.6+ MB
Age                  0
Workclass            0
fnlwgt               0
Education            0
Education-num        0
Marital_Status       0
Occupation           0
Relationship         0
Race                 0
Sex                  0
Capital-gain         0
Capital-loss         0
hrs_per_week         0
Native-Country       0
Earning_potential    0
dtype: int64 

 None

Now that we have treated our data and cleaned our null values, we can go ahead and encode our data.

Label Encoding¶

Label Encoding our categorical columns. We can one hot encode them too but that's a whole different thing in itself.

adult_data[categorical_columns].head()

label_encoder = LabelEncoder()
encoded_adult_data = adult_data
for i in categorical_columns:
    encoded_adult_data[i] = label_encoder.fit_transform(adult_data[i])
encoded_adult_data[categorical_columns].head()

Scaling the data¶

Usually we will perform multiple types of scaling and see which one works on the dataset better.¶

In this case we will use minmax scaler.

min_max_scaler = MinMaxScaler()

scaled_encoded_adult_data = pd.DataFrame()

column_values = encoded_adult_data.columns.values
column_values = column_values[:-1]
print(column_values[-1])

scaled_values = min_max_scaler.fit_transform(encoded_adult_data[column_values])

for i in range(len(column_values)):
    scaled_encoded_adult_data[column_values[i]] = scaled_values[:,i]
    
scaled_encoded_adult_data['Earning_potential'] = encoded_adult_data['Earning_potential']
scaled_encoded_adult_data.sample(10)

# encoded_adult_data.head()

Native-Country

scaled_encoded_adult_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32561 entries, 0 to 32560
Data columns (total 15 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Age                32561 non-null  float64
 1   Workclass          32561 non-null  float64
 2   fnlwgt             32561 non-null  float64
 3   Education          32561 non-null  float64
 4   Education-num      32561 non-null  float64
 5   Marital_Status     32561 non-null  float64
 6   Occupation         32561 non-null  float64
 7   Relationship       32561 non-null  float64
 8   Race               32561 non-null  float64
 9   Sex                32561 non-null  float64
 10  Capital-gain       32561 non-null  float64
 11  Capital-loss       32561 non-null  float64
 12  hrs_per_week       32561 non-null  float64
 13  Native-Country     32561 non-null  float64
 14  Earning_potential  32561 non-null  int32  
dtypes: float64(14), int32(1)
memory usage: 3.6 MB

scaled_encoded_adult_data.describe().T

Outlier detection¶

for i in range(len(numerical_columns)):
    plt.figure(figsize=(15,10))
    sns.boxplot(scaled_encoded_adult_data[numerical_columns[i]])
plt.show()

As we can see in the graphs above, Scaling does nothing to the distribution and does not deal with the outliers either. We have to take care of the outliers.

If the columns are continuous, we replace the outliers with the value of the medians and if they are categorical, we replace the outliers with the mode.
We have already established the numeric and categorical columns. So, it will be easier for us to deal with them now.

Outlier Treatment -> Replace with median¶

def outlier_detector(datacolumn):
    sorted(datacolumn)
    Q1,Q3 = np.percentile(datacolumn,[25,75])
    IQR = Q3 - Q1
    lower_bound = Q1-(1.5*IQR)
    upper_bound = Q3+(1.5*IQR)
    return lower_bound,upper_bound
# This takes a column of the dataframe (a series), 
# checks for the percentile we want to check it for and then calculates and the upper and lower bounds

lowerbound, upperbound = outlier_detector(scaled_encoded_adult_data['Age'])
lowerbound, upperbound

(-0.2602739726027397, 0.8356164383561644)

scaled_encoded_adult_data[(scaled_encoded_adult_data.Age < lowerbound) | (scaled_encoded_adult_data.Age > upperbound)]

Looping the outlier_detector through all numerical columns and then replacing them with the median.¶

we should not consider the sparse columns so we will remove those from our outlier treatment columns¶

new_columns = numerical_columns.copy()
new_columns.remove('Capital-gain') #Sparse column, must not be treated
new_columns.remove('Capital-loss') #Sparse column, must not be treated
new_columns

['Age', 'fnlwgt', 'Education-num', 'hrs_per_week']

treated_scaled_encoded_adult_data = scaled_encoded_adult_data.copy()
for i in new_columns:
    lowerbound, upperbound = outlier_detector(treated_scaled_encoded_adult_data[i])
    median = treated_scaled_encoded_adult_data[i].median()
    treated_scaled_encoded_adult_data[i] = treated_scaled_encoded_adult_data[i].replace(
        to_replace = treated_scaled_encoded_adult_data[(treated_scaled_encoded_adult_data[i] < lowerbound) | 
                                                       (treated_scaled_encoded_adult_data[i] > upperbound)][i],
                                      value = median)
    print('{}: number of outliers: {}'.format(i,treated_scaled_encoded_adult_data[
        (treated_scaled_encoded_adult_data[i] < lowerbound) |
        (treated_scaled_encoded_adult_data[i] > upperbound)][i]))

Age: number of outliers: Series([], Name: Age, dtype: float64)
fnlwgt: number of outliers: Series([], Name: fnlwgt, dtype: float64)
Education-num: number of outliers: Series([], Name: Education-num, dtype: float64)
hrs_per_week: number of outliers: Series([], Name: hrs_per_week, dtype: float64)

Now that we have treated our outliers, we can now go ahead and plot a correlation heatmap.

fig,ax=plt.subplots(figsize=(20,15))
ax=sns.heatmap(treated_scaled_encoded_adult_data.corr(),annot=True)

From the heatmap we see that none of the columns are correlated to each other,
i.e. None of them have a correlation value of >0.7 or <-0.7. So we must find another way to find our features.
Selecting all features and the target column

print(all_columns)

features = all_columns[:-1]
target = treated_scaled_encoded_adult_data['Earning_potential']
print(features)
print(treated_scaled_encoded_adult_data.shape)

['Age', 'Workclass', 'fnlwgt', 'Education', 'Education-num', 'Marital_Status', 'Occupation', 'Relationship', 'Race', 'Sex', 'Capital-gain', 'Capital-loss', 'hrs_per_week', 'Native-Country', 'Earning_potential']
['Age', 'Workclass', 'fnlwgt', 'Education', 'Education-num', 'Marital_Status', 'Occupation', 'Relationship', 'Race', 'Sex', 'Capital-gain', 'Capital-loss', 'hrs_per_week', 'Native-Country']
(32561, 15)

We will now make a new dataframe and use it for our train test splitting.

Copy of main dataframe to use for model training¶

feature_df = treated_scaled_encoded_adult_data[features]
print(target.head())
feature_df.head()

0    0
1    0
2    0
3    0
4    0
Name: Earning_potential, dtype: int32

We will not be using PCA for feature extraction because as we have seen before, a lot of columns have very high variance but not necessarity contribute much to the data. So, using PCA will be a bad idea because we might end up picking up high variance data that has nothing to do with our problem.

Train-validation-test splitting¶

x_train, x_test, y_train, y_test = train_test_split(feature_df, target, test_size=0.2)

print(x_train.shape,y_train.shape, x_test.shape, y_test.shape)

(26048, 14) (26048,) (6513, 14) (6513,)

Model Building¶

We shall build models to check if they perform well after out data preprocessing

Logistic Regression¶

We will start with Logistic Regression. Since the target column is a bivariate value, LogisticRegression can be used.

logistic_regressor = LogisticRegression()

logistic_regressor.fit(x_train, y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

logistic_train_score = logistic_regressor.score(x_train, y_train)
logistic_test_score = logistic_regressor.score(x_test, y_test)
logistic_prediction = logistic_regressor.predict(x_test)

print('Train Score: {0}\nTest Score: {1}'.format(logistic_train_score, logistic_test_score))

Train Score: 0.8220976658476659
Test Score: 0.8288039306003377

logistic_mse = mean_squared_error(y_test, logistic_prediction)
logistic_rmse = np.sqrt(logistic_mse)
print(logistic_mse, logistic_rmse)

0.17119606939966223 0.41375846746581785

We see that our logistic regression does not perform well. This is because we have multiple features which makes it difficult for Logistic regression to predict values. That is why we will be using other algorithms and testing them.

KNN Classifier¶

Before we start building our KNN model, we need to check for what value of K does our model have the least error. That will help us build a more optimal model

error_rate = []
# Will take some time
k_values = list(filter(lambda x: x%2==1, range(0,50)))
best_k = 0
for i in k_values:
    knn = KNeighborsClassifier(n_neighbors=i)
    knn.fit(x_train,y_train)
    pred_i = knn.predict(x_test)
    error_rate.append(np.mean(pred_i != y_test))
print(error_rate.index(np.min(error_rate)))

12

if index is 12, the value of k should be 2 * index + 1 Thus, The optimum value of K is 25.

This may change if we run this notebook again because we have not set any random state.¶

plt.figure(figsize=(10,10))
plt.plot(k_values,error_rate,color='blue', linestyle='dashed', marker='o',
         markerfacecolor='red', markersize=10)
plt.title('Error Rate vs. K Value')
plt.xlabel('K')
plt.ylabel('Error Rate')

Text(0, 0.5, 'Error Rate')

We see that the value 19 is the value with the least error. Thus, We will take n_neighbors to be 25 for our model.

knn_classifier = KNeighborsClassifier(n_neighbors=25)
knn_classifier.fit(x_train, y_train)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
                     metric_params=None, n_jobs=None, n_neighbors=21, p=2,
                     weights='uniform')

knn_train_score = knn_classifier.score(x_train, y_train)
knn_test_score = knn_classifier.score(x_test, y_test)

print('Train score: {}\nTest score: {}'.format(knn_train_score, knn_test_score))

Train score: 0.8459382678132679
Test score: 0.8315676339628435

knn_prediction = knn_classifier.predict(x_test)

knn_classifier_mse = mean_squared_error(y_test, knn_prediction)
knn_classifier_rmse = np.sqrt(knn_classifier_mse)

print('MSE: {}\nRMSE: {}'.format(knn_classifier_mse, knn_classifier_rmse))

MSE: 0.16843236603715644
RMSE: 0.4104051242822834

We see the KNN did a little better than Logistic regression. This can be because of the number of features we have. We shall try other algorithms and see if they perform better.

Support Vector Classifier¶

svc = SVC(kernel='rbf')
svc.fit(x_train, y_train)

SVC(C=1.0, break_ties=False, cache_size=200, class_weight=None, coef0=0.0,
    decision_function_shape='ovr', degree=3, gamma='scale', kernel='rbf',
    max_iter=-1, probability=False, random_state=None, shrinking=True,
    tol=0.001, verbose=False)

svc_train_score = svc.score(x_train, y_train)
svc_test_score = svc.score(x_test, y_test)

print('Train score: {}\nTest score: {}'.format(svc_train_score, svc_test_score))

Train score: 0.8443642506142506
Test score: 0.8506064793489944

svc_prediction = svc.predict(x_test)

svc_mse = mean_squared_error(y_test, svc_prediction)
svc_rmse = np.sqrt(svc_mse)

print('MSE: {}\nRMSE: {}'.format(svc_mse, svc_rmse))

MSE: 0.14939352065100567
RMSE: 0.3865145801273293

Again, the accuracy here is pretty low. Maybe we need to revisit our feature extraction part again.

Decision Tree Classifier¶

dtree_classifier = DecisionTreeClassifier(min_impurity_decrease = 0.05)
dtree_classifier.fit(x_train, y_train)

DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None, criterion='gini',
                       max_depth=None, max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.05, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, presort='deprecated',
                       random_state=None, splitter='best')

dtree_train_score = dtree_classifier.score(x_train, y_train)
dtree_test_score = dtree_classifier.score(x_test, y_test)

print('Train score: {}\nTest score: {}'.format(dtree_train_score, dtree_test_score))

Train score: 0.7576781326781327
Test score: 0.7652387532627054

dtree_prediction = dtree_classifier.predict(x_test)

dtree_mse = mean_squared_error(y_test, svc_prediction)
dtree_rmse = np.sqrt(dtree_mse)

print('MSE: {}\nRMSE: {}'.format(dtree_mse, dtree_rmse))

MSE: 0.14939352065100567
RMSE: 0.3865145801273293

Ensembling with Boosting:- AdaBoostClassifier¶

adaboost_classifier = AdaBoostClassifier(n_estimators=3)
adaboost_classifier.fit(x_train,y_train)

AdaBoostClassifier(algorithm='SAMME.R', base_estimator=None, learning_rate=1.0,
                   n_estimators=3, random_state=None)

adaboost_train_score = adaboost_classifier.score(x_train,y_train)
adaboost_test_score = adaboost_classifier.score(x_test,y_test)
print('Train score: {}\nTest score: {}'.format(adaboost_train_score, adaboost_test_score))

Train score: 0.8374539312039312
Test score: 0.8426224474128666

adaboost_prediction = adaboost_classifier.predict(x_test)

adaboost_mse = mean_squared_error(y_test, adaboost_prediction)
adaboost_rmse = np.sqrt(adaboost_mse)

print('MSE: {}\nRMSE: {}'.format(adaboost_mse, adaboost_rmse))

MSE: 0.15737755258713343
RMSE: 0.3967083974245232

Ensembling with Bagging:- RandomForest Classifier¶

random_forest_classifier = RandomForestClassifier(n_estimators=20, min_samples_split=15, min_impurity_decrease=0.05)
random_forest_classifier.fit(x_train, y_train)

RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='gini', max_depth=None, max_features='auto',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.05, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=15,
                       min_weight_fraction_leaf=0.0, n_estimators=20,
                       n_jobs=None, oob_score=False, random_state=None,
                       verbose=0, warm_start=False)

random_forest_train_score = random_forest_classifier.score(x_train,y_train)
random_forest_test_score = random_forest_classifier.score(x_test,y_test)
print('Train score: {}\nTest score: {}'.format(random_forest_train_score, random_forest_test_score))

Train score: 0.7576781326781327
Test score: 0.7652387532627054

random_forest_prediction = random_forest_classifier.predict(x_test)

random_forest_mse = mean_squared_error(y_test, random_forest_prediction)
random_forest_rmse = np.sqrt(random_forest_mse)

print('MSE: {}\nRMSE: {}'.format(random_forest_mse, random_forest_rmse))

MSE: 0.23476124673729465
RMSE: 0.4845216679750191

Conclusion¶

Analysis¶

As soon as we look at the dataset we realize that this is a US based survey. Mostly, people of the white and black ethinicity took part in the survey but other ethinicities were present as well. There was less biased data when it came to people belonging to the Asian-Pac-Island group where the ratio between people earning more than 50K and those earning less than the same was lower compared to other ethinicities. This dataset also containes information about more males than females. This may be because females did not prefer to take the survey. Also, this dataset is biased towards people making <=50K USD.

As we went through with the analysis, we found many interesting things. Most people usually go and find work right after their Highschool. However, some persue bachelors or higher studies like masters and doctorate or specializations tend to earn more. Some people don't even make it through highschool and these people almost always earn less than 50K which might be because of a lack of skill, education, exposure or more.

We noticed that there is barely any capital gain or capital loss for most people. Which leads us to believe that there is not a lot of growth in economy. However, the amount of gain people make is overwhelming compared to the amount they lose.

With our heatmap, we saw no mathematical correlation but from the other analysis methods we found some insightful information. From this survey we see that a lot of women earn less than 50K. It's just not women, but minorities in Race also seem to earn less.

People tend to work 40 hour weeks but it is not unusual to see people working a lot more or a lot less. And the age group of people working range from a young age of 17 to the age of over 90. It is interesting as people that old work as well. Both of these features combined give tell us that people older than 60 usually tend to work less. Most people who earn more than 50k either work long weeks or short weeks.

Model Evaluation¶

We see That none of the models perform very well and so we did not check for any other metric like classification report or the confusion matrix. This may be because of the features we chose. There is very high variance within the data which meant that we should not use PCA for dimentionality reduction as it would only choose features with high variance which doesn't necessarily mean that those features have anything to do with our target. We will have to use Neural networks or better feature extraction methods to get better results.

	count	mean	std	min	25%	50%	75%	max
Age	32561.0	38.581647	13.640433	17.0	28.0	37.0	48.0	90.0
fnlwgt	32561.0	189778.366512	105549.977697	12285.0	117827.0	178356.0	237051.0	1484705.0
Education-num	32561.0	10.080679	2.572720	1.0	9.0	10.0	12.0	16.0
Capital-gain	32561.0	1077.648844	7385.292085	0.0	0.0	0.0	0.0	99999.0
Capital-loss	32561.0	87.303830	402.960219	0.0	0.0	0.0	0.0	4356.0
hrs_per_week	32561.0	40.437456	12.347429	1.0	40.0	40.0	45.0	99.0

	Age	fnlwgt	Education-num	Capital-gain	Capital-loss	hrs_per_week
Age	1.000000	-0.076646	0.036527	0.077674	0.057775	0.068756
fnlwgt	-0.076646	1.000000	-0.043195	0.000432	-0.010252	-0.018768
Education-num	0.036527	-0.043195	1.000000	0.122630	0.079923	0.148123
Capital-gain	0.077674	0.000432	0.122630	1.000000	-0.031615	0.078409
Capital-loss	0.057775	-0.010252	0.079923	-0.031615	1.000000	0.054256
hrs_per_week	0.068756	-0.018768	0.148123	0.078409	0.054256	1.000000

	Age	Workclass	fnlwgt	Education	Education-num	Marital_Status	Occupation	Relationship	Race	Sex	Capital-gain	hrs_per_week	Native-Country	Earning_potential
10715	0.726027	0.00	0.058099	0.733333	0.533333	1.000000	0.000000	0.8	1.0	0.0	0.00000	0.030612	0.95122	0
2302	0.123288	0.50	0.254207	1.000000	0.600000	0.000000	0.928571	0.2	1.0	1.0	0.00000	0.602041	0.95122	0
9946	0.424658	0.50	0.108340	0.600000	0.800000	0.000000	0.571429	0.8	1.0	0.0	0.00000	0.561224	0.95122	1
15011	0.041096	0.50	0.138958	1.000000	0.600000	0.666667	0.785714	0.6	1.0	1.0	0.00000	0.397959	0.95122	0
24169	0.561644	0.75	0.102391	0.600000	0.800000	0.000000	0.714286	0.2	1.0	0.0	0.00000	0.397959	0.95122	0
15716	0.397260	0.50	0.096885	1.000000	0.600000	0.000000	0.857143	0.8	1.0	0.0	0.00000	0.397959	0.95122	0
16933	0.328767	0.50	0.072657	0.733333	0.533333	0.333333	0.571429	0.0	1.0	1.0	0.00000	0.397959	0.95122	0
9138	0.136986	0.00	0.033348	0.600000	0.800000	0.333333	0.000000	0.0	1.0	1.0	0.00000	0.142857	0.95122	0
29289	0.178082	0.50	0.216741	0.466667	0.733333	0.666667	0.214286	0.2	1.0	1.0	0.04787	0.500000	0.95122	1
13628	0.178082	0.50	0.096328	0.600000	0.800000	0.666667	0.857143	0.2	1.0	1.0	0.00000	0.653061	0.95122	0

	count	mean	std	25%	50%	75%	max
Age	32561.0	0.295639	0.186855	0.150685	0.273973	0.424658	1.0
Workclass	32561.0	0.483612	0.181995	0.500000	0.500000	0.500000	1.0
fnlwgt	32561.0	0.120545	0.071685	0.071679	0.112788	0.152651	1.0
Education	32561.0	0.686547	0.258018	0.600000	0.733333	0.800000	1.0
Education-num	32561.0	0.605379	0.171515	0.533333	0.600000	0.733333	1.0
Marital_Status	32561.0	0.435306	0.251037	0.333333	0.333333	0.666667	1.0
Occupation	32561.0	0.469481	0.302061	0.214286	0.500000	0.714286	1.0
Relationship	32561.0	0.289272	0.321354	0.000000	0.200000	0.600000	1.0
Race	32561.0	0.916464	0.212201	1.000000	1.000000	1.000000	1.0
Sex	32561.0	0.669205	0.470506	0.000000	1.000000	1.000000	1.0
Capital-gain	32561.0	0.010777	0.073854	0.000000	0.000000	0.000000	1.0
Capital-loss	32561.0	0.020042	0.092507	0.000000	0.000000	0.000000	1.0
hrs_per_week	32561.0	0.402423	0.125994	0.397959	0.397959	0.448980	1.0
Native-Country	32561.0	0.895582	0.190824	0.951220	0.951220	0.951220	1.0
Earning_potential	32561.0	0.240810	0.427581	0.000000	0.000000	0.000000	1.0

	Age	Workclass	fnlwgt	Education	Education-num	Marital_Status	Occupation	Relationship	Race	Sex	Capital-gain	Capital-loss	hrs_per_week	Native-Country	Earning_potential
74	0.849315	0.500	0.076377	1.000000	0.600000	0.333333	0.714286	0.4	1.00	1.0	0.000000	0.000000	0.193878	0.951220	0
222	1.000000	0.500	0.026799	0.733333	0.533333	0.666667	0.571429	0.2	0.50	1.0	0.000000	0.506428	0.397959	0.951220	0
430	0.863014	0.000	0.064844	0.733333	0.533333	1.000000	0.000000	0.2	1.00	1.0	0.000000	0.000000	0.234694	0.951220	0
918	0.876712	0.750	0.084064	0.733333	0.533333	0.333333	0.285714	0.0	1.00	1.0	0.000000	0.000000	0.295918	0.951220	0
1040	1.000000	0.500	0.084713	0.733333	0.533333	0.666667	0.571429	0.2	1.00	0.0	0.000000	0.000000	0.397959	0.951220	0
1168	0.972603	0.750	0.131760	0.933333	0.933333	0.333333	0.714286	0.0	1.00	1.0	0.000000	0.000000	0.397959	0.951220	0
1935	1.000000	0.500	0.142315	0.600000	0.800000	0.333333	0.285714	0.0	1.00	1.0	0.000000	0.000000	0.448980	0.951220	0
2303	1.000000	0.500	0.027235	1.000000	0.600000	0.666667	0.571429	0.2	0.25	1.0	0.000000	0.000000	0.346939	0.951220	0
2754	0.863014	0.750	0.116848	0.333333	0.200000	0.333333	0.357143	0.0	1.00	1.0	0.000000	0.000000	0.295918	0.951220	0
2891	1.000000	0.500	0.108441	1.000000	0.600000	0.833333	0.071429	0.6	1.00	0.0	0.000000	0.000000	0.397959	0.804878	0
2906	0.876712	0.500	0.069535	0.400000	0.266667	1.000000	0.642857	0.2	0.50	0.0	0.020620	0.000000	0.040816	0.951220	0
3211	0.890411	0.000	0.011652	0.333333	0.200000	1.000000	0.000000	0.2	1.00	1.0	0.000000	0.000000	0.040816	0.951220	0
3338	0.849315	0.000	0.089817	0.733333	0.533333	1.000000	0.000000	0.2	0.50	0.0	0.000000	0.000000	0.295918	0.951220	0
3537	0.876712	0.750	0.084713	0.733333	0.533333	1.000000	0.071429	0.2	1.00	0.0	0.000000	0.000000	0.193878	0.951220	0
3777	0.863014	0.500	0.051095	1.000000	0.600000	0.666667	0.714286	0.2	1.00	1.0	0.000000	0.416896	0.602041	0.951220	0
3963	0.904110	0.000	0.162770	0.733333	0.533333	1.000000	0.000000	0.2	1.00	0.0	0.000000	0.000000	0.193878	0.951220	0
4070	1.000000	0.500	0.204901	0.066667	0.400000	0.666667	0.428571	0.6	1.00	1.0	0.000000	0.000000	0.397959	0.951220	0
4109	1.000000	0.000	0.165869	0.600000	0.800000	1.000000	0.000000	0.4	1.00	0.0	0.009910	0.000000	0.091837	0.951220	0
4720	0.849315	0.750	0.099180	0.733333	0.533333	1.000000	0.571429	0.2	1.00	0.0	0.000000	0.000000	0.234694	0.951220	0
4834	0.876712	0.500	0.238936	1.000000	0.600000	0.000000	0.071429	0.2	1.00	0.0	0.000000	0.000000	0.193878	0.951220	0
5104	1.000000	0.500	0.027235	1.000000	0.600000	0.666667	0.571429	0.2	0.25	1.0	0.000000	0.000000	0.346939	0.951220	0
5272	1.000000	0.500	0.087932	0.400000	0.266667	0.666667	0.071429	0.2	1.00	0.0	0.000000	0.000000	0.397959	0.951220	0
5291	0.863014	0.500	0.098812	0.800000	0.866667	1.000000	0.714286	0.2	1.00	0.0	0.000000	0.000000	0.091837	0.951220	0
5370	1.000000	0.250	0.146365	0.800000	0.866667	0.333333	0.285714	0.0	1.00	1.0	0.200512	0.000000	0.602041	0.951220	1
5406	1.000000	0.500	0.026799	0.800000	0.866667	0.666667	0.285714	0.2	0.50	1.0	0.000000	0.000000	0.500000	0.951220	1
6000	0.849315	0.250	0.049124	0.066667	0.400000	0.333333	0.285714	0.0	1.00	1.0	0.000000	0.000000	0.061224	0.951220	0
6173	0.849315	0.500	0.073635	0.666667	1.000000	0.333333	0.714286	0.0	1.00	1.0	0.200512	0.000000	0.346939	0.195122	1
6214	0.917808	0.750	0.096964	0.600000	0.800000	0.333333	0.357143	0.0	1.00	1.0	0.000000	0.000000	0.193878	0.951220	0
6232	1.000000	0.750	0.097592	0.600000	0.800000	0.333333	0.714286	0.0	1.00	1.0	0.105661	0.000000	0.500000	0.951220	0
6439	0.863014	0.500	0.161434	0.333333	0.200000	1.000000	0.571429	0.2	1.00	0.0	0.000000	0.000000	0.234694	0.951220	0
6624	1.000000	0.500	0.204901	0.066667	0.400000	0.333333	0.214286	0.0	1.00	1.0	0.000000	0.000000	0.397959	0.951220	0
6748	0.876712	0.500	0.074956	1.000000	0.600000	0.333333	0.857143	0.0	1.00	1.0	0.000000	0.000000	0.142857	0.951220	0
7481	0.904110	0.500	0.124436	0.733333	0.533333	1.000000	0.785714	0.2	1.00	1.0	0.000000	0.000000	0.551020	0.951220	0
7720	0.917808	0.500	0.155377	0.800000	0.866667	0.666667	0.714286	0.2	1.00	1.0	0.000000	0.000000	0.663265	0.951220	0
7872	0.876712	0.000	0.102279	0.733333	0.533333	0.000000	0.000000	0.2	1.00	0.0	0.000000	0.000000	0.346939	0.951220	0
8176	0.849315	0.500	0.074050	1.000000	0.600000	0.333333	0.071429	0.0	1.00	1.0	0.000000	0.000000	0.193878	0.951220	0
8381	0.931507	0.500	0.070007	0.733333	0.533333	1.000000	0.857143	0.8	1.00	1.0	0.000000	0.000000	0.346939	0.951220	0
8431	0.890411	0.500	0.094989	0.333333	0.200000	0.333333	0.428571	0.0	1.00	1.0	0.000000	0.000000	0.010204	0.951220	0
8522	0.849315	0.500	0.217971	0.733333	0.533333	0.500000	0.714286	0.2	1.00	1.0	0.000000	0.000000	0.051020	0.951220	0
8694	0.863014	0.000	0.011366	0.933333	0.933333	0.333333	0.000000	0.0	1.00	1.0	0.106051	0.000000	0.091837	0.951220	1
8806	1.000000	0.500	0.050996	0.933333	0.933333	0.333333	0.714286	0.0	1.00	1.0	0.200512	0.000000	0.724490	0.951220	1
8963	1.000000	0.000	0.043987	0.733333	0.533333	1.000000	0.000000	0.2	1.00	0.0	0.000000	1.000000	0.397959	0.951220	0
8973	1.000000	0.500	0.023431	0.600000	0.800000	0.333333	0.857143	0.0	1.00	1.0	0.093861	0.000000	0.142857	0.951220	1
9471	0.917808	0.250	0.102824	0.733333	0.533333	1.000000	0.285714	0.2	1.00	0.0	0.000000	0.000000	0.326531	0.951220	0
10124	0.863014	0.750	0.014979	0.333333	0.200000	1.000000	0.357143	0.2	1.00	1.0	0.000000	0.000000	0.346939	0.951220	0
10210	1.000000	0.750	0.183243	1.000000	0.600000	0.333333	0.357143	0.0	1.00	1.0	0.000000	0.000000	0.397959	0.951220	0
10545	1.000000	0.500	0.110842	0.733333	0.533333	0.333333	0.214286	0.0	1.00	1.0	0.093861	0.000000	0.500000	0.170732	1
11099	0.849315	0.000	0.103859	0.733333	0.533333	0.333333	0.000000	0.0	1.00	1.0	0.000000	0.000000	0.397959	0.951220	0
11238	0.917808	0.625	0.109087	1.000000	0.600000	0.333333	0.857143	0.0	1.00	1.0	0.000000	0.000000	0.346939	0.951220	1
11512	1.000000	0.500	0.050937	0.733333	0.533333	0.666667	0.571429	0.6	1.00	0.0	0.000000	0.000000	0.234694	0.951220	0
11532	0.849315	0.000	0.088348	0.933333	0.933333	0.333333	0.000000	0.0	1.00	1.0	0.000000	0.000000	0.091837	0.951220	0
11731	1.000000	0.000	0.018703	0.733333	0.533333	1.000000	0.000000	0.2	1.00	1.0	0.004010	0.000000	0.030612	0.951220	0
11996	1.000000	0.500	0.019086	0.600000	0.800000	0.666667	0.285714	0.2	1.00	1.0	0.000000	0.000000	0.551020	0.951220	0
12451	1.000000	0.000	0.144509	1.000000	0.600000	0.666667	0.000000	0.6	0.25	1.0	0.000000	0.000000	0.091837	0.853659	0
12492	0.890411	0.000	0.027598	1.000000	0.600000	1.000000	0.000000	0.2	0.00	1.0	0.000000	0.000000	0.020408	0.951220	0
12830	0.876712	0.500	0.128437	0.800000	0.866667	1.000000	0.714286	0.8	1.00	1.0	0.000000	0.000000	0.602041	0.000000	0
12975	1.000000	0.500	0.162010	0.000000	0.333333	0.333333	0.285714	0.0	1.00	1.0	0.000000	0.000000	0.397959	0.951220	0
13025	0.917808	0.000	0.242213	0.266667	0.133333	1.000000	0.000000	0.2	1.00	1.0	0.000000	0.000000	0.142857	0.951220	0
13026	0.849315	0.000	0.060170	0.466667	0.733333	0.333333	0.000000	1.0	1.00	0.0	0.000000	0.000000	0.010204	0.951220	1
13295	0.876712	0.500	0.060030	0.200000	0.066667	0.333333	0.714286	0.0	1.00	1.0	0.000000	0.000000	0.142857	0.756098	0
13696	0.890411	0.750	0.154987	0.733333	0.533333	0.333333	0.214286	0.0	1.00	1.0	0.000000	0.000000	0.397959	0.121951	0
13928	0.876712	0.750	0.075844	0.600000	0.800000	1.000000	0.714286	0.2	1.00	0.0	0.000000	0.382920	0.020408	0.439024	0
14104	0.863014	0.500	0.092595	0.933333	0.933333	0.333333	0.714286	0.0	1.00	1.0	0.000000	0.000000	0.346939	0.951220	1
14159	1.000000	0.250	0.119167	0.466667	0.733333	0.333333	0.071429	0.0	0.25	1.0	0.000000	0.000000	0.193878	0.731707	0
14604	0.863014	0.500	0.109482	1.000000	0.600000	1.000000	0.071429	0.2	1.00	0.0	0.000000	0.000000	0.193878	0.951220	0
14711	0.917808	0.250	0.083912	0.533333	0.666667	1.000000	0.071429	0.2	1.00	0.0	0.000000	0.000000	0.132653	0.951220	0
14756	0.890411	0.500	0.081896	0.733333	0.533333	1.000000	0.285714	0.2	1.00	0.0	0.000000	1.000000	0.173469	0.951220	0
14903	0.849315	0.625	0.201700	0.733333	0.533333	0.333333	0.285714	0.0	1.00	1.0	0.000000	0.000000	0.397959	0.951220	1
15356	1.000000	0.500	0.053136	0.733333	0.533333	1.000000	1.000000	0.8	1.00	1.0	0.000000	0.000000	1.000000	0.951220	0
15662	0.917808	0.500	0.081852	0.733333	0.533333	1.000000	0.571429	0.2	1.00	0.0	0.000000	0.000000	0.122449	0.951220	0
15892	1.000000	0.500	0.052095	0.600000	0.800000	0.333333	0.285714	1.0	1.00	0.0	0.000000	0.000000	0.397959	0.219512	1
16302	0.904110	0.750	0.136905	0.733333	0.533333	1.000000	0.285714	0.2	1.00	1.0	0.000000	0.000000	0.071429	0.951220	0
16523	0.849315	0.000	0.102454	0.733333	0.533333	0.333333	0.000000	0.0	1.00	1.0	0.000000	0.000000	0.397959	0.951220	1
16762	0.876712	0.000	0.052111	0.733333	0.533333	0.333333	0.000000	0.0	1.00	1.0	0.000000	0.000000	0.173469	0.951220	0
16901	0.863014	0.750	0.060775	0.066667	0.400000	0.333333	0.285714	0.0	1.00	1.0	0.000000	0.000000	0.244898	0.951220	0
17609	0.849315	0.625	0.057590	0.133333	0.466667	1.000000	0.857143	0.2	1.00	1.0	0.184812	0.000000	0.448980	0.951220	1
18037	0.863014	0.750	0.081799	0.733333	0.533333	0.666667	0.285714	0.2	1.00	1.0	0.000000	0.000000	0.193878	0.951220	0
18141	0.849315	0.750	0.049370	0.533333	0.666667	0.333333	0.857143	0.0	1.00	1.0	0.000000	0.000000	0.193878	0.951220	1
18272	0.863014	0.500	0.050139	0.666667	1.000000	0.333333	0.714286	0.0	1.00	1.0	0.000000	0.000000	0.295918	0.951220	0
18277	1.000000	0.500	0.202998	0.600000	0.800000	0.333333	0.857143	0.0	1.00	1.0	0.000000	0.000000	0.193878	0.000000	0
18413	1.000000	0.500	0.204740	0.600000	0.800000	0.666667	0.714286	0.6	1.00	0.0	0.000000	0.000000	0.091837	0.951220	0
18560	0.863014	0.000	0.109032	0.733333	0.533333	0.333333	0.000000	0.0	1.00	1.0	0.000000	0.000000	0.071429	0.951220	0
18725	1.000000	0.250	0.095976	0.733333	0.533333	0.333333	0.571429	0.0	1.00	1.0	0.067671	0.000000	0.397959	0.951220	0
18832	1.000000	0.500	0.069967	0.800000	0.866667	0.666667	0.285714	0.6	1.00	0.0	0.000000	0.000000	0.397959	0.951220	0
19045	0.876712	0.875	0.081443	0.200000	0.066667	1.000000	0.571429	0.2	1.00	0.0	0.000000	0.000000	0.193878	0.951220	0
19172	0.904110	0.625	0.176555	0.733333	0.533333	0.000000	0.857143	0.2	1.00	1.0	0.000000	0.000000	0.193878	0.951220	0
19180	0.890411	0.000	0.020476	0.000000	0.333333	1.000000	0.000000	0.2	1.00	1.0	0.000000	0.000000	0.193878	0.951220	0
19212	1.000000	0.500	0.086507	1.000000	0.600000	0.000000	0.857143	0.8	0.50	0.0	0.000000	0.000000	0.367347	0.951220	0
19489	1.000000	0.500	0.049081	0.733333	0.533333	0.333333	0.500000	0.0	1.00	1.0	0.000000	0.000000	0.397959	0.951220	0
19495	0.876712	0.625	0.159565	0.000000	0.333333	0.333333	0.285714	1.0	1.00	0.0	0.029360	0.000000	0.275510	0.951220	0
19515	0.863014	0.250	0.005308	0.733333	0.533333	1.000000	0.571429	0.8	0.00	0.0	0.000000	0.000000	0.316327	0.951220	0
19689	0.863014	0.750	0.373569	0.733333	0.533333	0.333333	0.357143	0.0	1.00	1.0	0.000000	0.000000	0.193878	0.951220	0
19747	1.000000	0.500	0.145803	0.333333	0.200000	0.333333	0.500000	0.0	1.00	1.0	0.000000	0.000000	0.397959	0.951220	0
19828	0.849315	0.500	0.108621	0.333333	0.200000	1.000000	0.642857	0.2	1.00	0.0	0.029640	0.000000	0.295918	0.951220	0
20249	0.863014	0.500	0.163120	0.533333	0.666667	0.333333	0.214286	0.0	1.00	1.0	0.000000	0.000000	0.234694	0.951220	0
20421	0.890411	0.500	0.091987	0.266667	0.133333	1.000000	0.571429	0.8	1.00	1.0	0.000000	0.000000	0.193878	0.951220	0
20463	0.931507	0.750	0.104415	0.733333	0.533333	1.000000	0.857143	0.2	1.00	0.0	0.000000	0.000000	0.500000	0.951220	0
20482	0.863014	0.500	0.129174	0.733333	0.533333	0.500000	0.071429	0.2	1.00	0.0	0.000000	0.000000	0.153061	0.951220	0
20483	0.849315	0.250	0.090979	0.666667	1.000000	1.000000	0.714286	0.8	1.00	0.0	0.000000	0.000000	0.397959	0.951220	0
20610	1.000000	0.500	0.132015	0.800000	0.866667	0.333333	0.714286	1.0	1.00	0.0	0.000000	0.000000	0.397959	0.951220	1
20826	0.876712	0.000	0.091558	0.600000	0.800000	1.000000	0.000000	0.2	1.00	1.0	0.000000	0.000000	0.040816	0.951220	0
20880	0.849315	0.000	0.088213	0.333333	0.200000	0.333333	0.000000	0.0	1.00	1.0	0.014090	0.000000	0.346939	0.951220	0
20953	0.863014	0.000	0.110505	0.733333	0.533333	0.333333	0.000000	0.0	1.00	1.0	0.000000	0.000000	0.071429	0.048780	0
21343	0.849315	0.500	0.172392	0.600000	0.800000	0.333333	0.857143	0.0	1.00	1.0	0.000000	0.000000	0.397959	0.951220	0
21501	0.876712	0.500	0.112144	0.733333	0.533333	0.333333	0.285714	0.0	1.00	1.0	0.000000	0.545684	0.255102	0.951220	1
21812	0.890411	0.000	0.123813	0.533333	0.666667	1.000000	0.000000	0.2	1.00	0.0	0.000000	0.000000	0.071429	0.951220	0
21835	0.972603	0.750	0.118724	0.933333	0.933333	0.333333	0.285714	0.0	1.00	1.0	0.000000	0.000000	0.397959	0.951220	0
22220	1.000000	0.500	0.027235	0.600000	0.800000	0.666667	0.714286	0.2	0.25	1.0	0.000000	0.000000	0.397959	0.951220	0
22481	0.890411	0.625	0.073432	1.000000	0.600000	1.000000	0.857143	0.2	1.00	1.0	0.000000	0.000000	0.193878	0.951220	0
22895	0.972603	0.500	0.038205	1.000000	0.600000	0.000000	0.071429	0.8	1.00	0.0	0.000000	0.000000	0.397959	0.951220	0
22898	0.917808	0.000	0.078034	0.266667	0.133333	0.333333	0.000000	0.0	1.00	1.0	0.000000	0.000000	0.193878	0.951220	0
23459	0.904110	0.625	0.154755	0.000000	0.333333	0.333333	0.357143	0.0	1.00	1.0	0.200512	0.000000	0.500000	0.951220	1
23900	0.849315	0.750	0.062074	0.733333	0.533333	0.333333	0.357143	0.0	1.00	1.0	0.000000	0.000000	0.397959	0.951220	0
24027	0.945205	0.500	0.093470	0.800000	0.866667	0.666667	0.071429	0.2	1.00	0.0	0.000000	0.000000	0.397959	0.951220	0
24043	1.000000	0.750	0.047774	0.733333	0.533333	0.666667	0.285714	0.2	1.00	1.0	0.029640	0.000000	0.112245	0.951220	0
24238	1.000000	0.000	0.104629	0.200000	0.066667	1.000000	0.000000	0.2	0.50	0.0	0.000000	0.000000	0.397959	0.951220	0
24280	0.890411	0.625	0.080170	0.333333	0.200000	0.333333	0.357143	0.0	1.00	1.0	0.000000	0.000000	0.500000	0.951220	0
24395	0.904110	0.625	0.095691	0.600000	0.800000	0.333333	0.285714	0.0	1.00	1.0	0.000000	0.549128	0.551020	0.951220	1
24560	0.876712	0.500	0.058292	0.600000	0.800000	1.000000	0.857143	0.2	1.00	1.0	0.000000	0.000000	0.500000	0.951220	1
25163	0.849315	0.000	0.043708	0.800000	0.866667	0.333333	0.000000	0.0	1.00	1.0	0.200512	0.000000	0.397959	0.756098	1
25303	1.000000	0.000	0.110810	0.333333	0.200000	0.833333	0.000000	0.2	1.00	0.0	0.000000	0.000000	0.142857	0.951220	0
25397	0.863014	0.000	0.054072	0.733333	0.533333	0.333333	0.000000	0.0	1.00	1.0	0.000000	0.000000	0.244898	0.951220	0
26012	0.876712	0.000	0.064166	1.000000	0.600000	1.000000	0.000000	0.8	1.00	0.0	0.000000	0.000000	0.030612	0.951220	0
26242	0.849315	0.625	0.116408	0.600000	0.800000	0.333333	0.857143	0.0	1.00	1.0	0.000000	0.000000	0.193878	0.951220	1
26731	0.917808	0.500	0.119560	0.733333	0.533333	1.000000	0.714286	0.2	1.00	0.0	0.000000	0.000000	0.153061	0.951220	0
27795	0.917808	0.500	0.255429	0.333333	0.200000	0.333333	0.714286	0.0	0.50	1.0	0.000000	0.000000	0.091837	0.951220	0
28176	0.849315	0.125	0.033884	0.666667	1.000000	1.000000	0.285714	0.2	1.00	1.0	0.000000	0.000000	0.051020	0.951220	1
28463	1.000000	0.125	0.124386	0.733333	0.533333	0.333333	0.214286	0.0	1.00	1.0	0.000000	0.000000	0.295918	0.951220	0
28721	0.863014	0.750	0.145072	0.733333	0.533333	0.333333	0.571429	0.0	1.00	1.0	0.014090	0.000000	0.397959	0.951220	0
28948	0.876712	0.500	0.079497	0.333333	0.200000	0.333333	1.000000	0.0	1.00	1.0	0.000000	0.000000	0.091837	0.951220	0
29594	0.876712	0.750	0.122894	0.200000	0.066667	1.000000	0.857143	0.4	1.00	1.0	0.000000	0.000000	0.448980	0.634146	0
29724	0.876712	0.000	0.052367	0.933333	0.933333	0.333333	0.000000	0.0	1.00	1.0	0.000000	0.000000	0.234694	0.951220	1
31030	1.000000	0.500	0.024208	0.733333	0.533333	0.333333	0.500000	0.0	1.00	1.0	0.000000	0.000000	0.397959	0.951220	0
31432	0.958904	0.000	0.053010	0.733333	0.533333	1.000000	0.000000	0.2	1.00	1.0	0.000000	0.000000	0.010204	0.951220	0
31696	1.000000	0.000	0.204901	0.733333	0.533333	0.333333	0.000000	0.0	1.00	1.0	0.000000	0.000000	0.397959	0.951220	1
31814	0.863014	0.750	0.009902	0.333333	0.200000	0.666667	0.357143	0.8	1.00	1.0	0.000000	0.000000	0.193878	0.951220	0
31836	0.863014	0.000	0.183020	0.466667	0.733333	0.333333	0.000000	0.0	1.00	1.0	0.000000	0.000000	0.030612	0.951220	0
31855	0.890411	0.750	0.040174	0.733333	0.533333	0.333333	0.357143	0.0	1.00	1.0	0.000000	0.000000	0.193878	0.951220	0
32277	1.000000	0.500	0.204740	0.733333	0.533333	1.000000	0.071429	0.8	1.00	0.0	0.000000	0.000000	0.244898	0.951220	0
32367	1.000000	0.250	0.137399	0.333333	0.200000	0.333333	0.785714	0.0	1.00	1.0	0.026530	0.000000	0.397959	0.951220	0
32459	0.931507	0.500	0.058629	0.600000	0.800000	0.333333	0.285714	0.0	1.00	1.0	0.000000	0.000000	0.020408	0.756098	0
32494	0.890411	0.000	0.265974	0.733333	0.533333	0.666667	0.000000	0.2	1.00	1.0	0.000000	0.000000	0.020408	0.951220	0
32525	0.876712	0.000	0.073480	0.533333	0.666667	0.000000	0.000000	0.8	1.00	0.0	0.000000	0.000000	0.000000	0.000000	0

	0	1	2	3	4
39	50	38	53	28	37
State-gov	Self-emp-not-inc	Private	Private	Private	Private
77516	83311	215646	234721	338409	284582
Bachelors	Bachelors	HS-grad	11th	Bachelors	Masters
13	13	9	7	13	14
Never-married	Married-civ-spouse	Divorced	Married-civ-spouse	Married-civ-spouse	Married-civ-spouse
Adm-clerical	Exec-managerial	Handlers-cleaners	Handlers-cleaners	Prof-specialty	Exec-managerial
Not-in-family	Husband	Not-in-family	Husband	Wife	Wife
White	White	White	Black	Black	White
Male	Male	Male	Male	Female	Female
2174	0	0	0	0	0
0	0	0	0	0	0
40	13	40	40	40	40
United-States	United-States	United-States	United-States	Cuba	United-States
<=50K	<=50K	<=50K	<=50K	<=50K	<=50K

	Age	Workclass	fnlwgt	Education	Education-num	Marital_Status	Occupation	Relationship	Race	Sex	Capital-gain	Capital-loss	hrs_per_week	Native-Country	Earning_potential
3415	61	Private	231323	Some-college	10	Married-civ-spouse	Craft-repair	Husband	White	Male	0	0	40	United-States	<=50K
10240	29	Private	162002	HS-grad	9	Never-married	Other-service	Own-child	White	Female	0	0	35	United-States	<=50K
15908	22	Local-gov	163205	HS-grad	9	Never-married	Other-service	Own-child	White	Female	0	0	53	United-States	<=50K
24419	35	Private	225860	Assoc-voc	11	Married-civ-spouse	Craft-repair	Husband	White	Male	0	0	40	United-States	>50K
12670	77	Private	253642	7th-8th	4	Married-civ-spouse	Machine-op-inspct	Husband	Other	Male	0	0	30	United-States	<=50K
29167	20	Private	192711	Some-college	10	Never-married	Handlers-cleaners	Own-child	White	Female	0	0	40	United-States	<=50K
18487	34	Private	207301	HS-grad	9	Divorced	Machine-op-inspct	Unmarried	White	Female	0	0	20	United-States	<=50K
5044	55	Private	31905	Masters	14	Married-civ-spouse	Exec-managerial	Husband	White	Male	0	1977	40	United-States	>50K
23650	55	Private	82098	HS-grad	9	Married-civ-spouse	Exec-managerial	Husband	Asian-Pac-Islander	Male	0	0	55	United-States	<=50K
28867	20	Private	282604	Some-college	10	Never-married	Handlers-cleaners	Other-relative	White	Male	0	0	20	United-States	<=50K
27527	34	Private	173495	Some-college	10	Married-civ-spouse	Craft-repair	Husband	White	Male	0	0	48	United-States	>50K
2696	18	Private	57413	Some-college	10	Divorced	Other-service	Own-child	White	Male	0	0	15	United-States	<=50K
19087	31	Self-emp-not-inc	111423	Bachelors	13	Never-married	Craft-repair	Not-in-family	White	Male	0	0	55	United-States	<=50K
6260	42	Private	341204	HS-grad	9	Divorced	Craft-repair	Other-relative	White	Female	0	0	40	United-States	<=50K
32000	25	Private	237065	Some-college	10	Divorced	Other-service	Own-child	Black	Male	0	0	38	United-States	<=50K
15037	46	Self-emp-not-inc	275625	Bachelors	13	Divorced	Other-service	Unmarried	Asian-Pac-Islander	Female	0	0	60	South	>50K
903	27	?	90270	Assoc-acdm	12	Married-civ-spouse	?	Own-child	Amer-Indian-Eskimo	Male	0	0	40	United-States	<=50K
11461	53	Private	548580	Some-college	10	Married-civ-spouse	Craft-repair	Husband	White	Male	0	0	40	Guatemala	<=50K
9938	36	?	36635	Some-college	10	Never-married	?	Unmarried	White	Female	0	0	25	United-States	<=50K
15929	32	Private	119124	Bachelors	13	Never-married	Prof-specialty	Own-child	White	Male	0	0	40	United-States	<=50K
31340	24	Private	211160	Some-college	10	Married-civ-spouse	Sales	Husband	White	Male	0	0	40	United-States	<=50K
23986	31	Private	352465	Some-college	10	Married-civ-spouse	Exec-managerial	Husband	White	Male	15024	0	50	United-States	>50K
3497	35	Self-emp-not-inc	168475	Some-college	10	Married-civ-spouse	Craft-repair	Husband	White	Male	0	0	60	United-States	>50K
13847	23	Private	189017	Bachelors	13	Never-married	Sales	Not-in-family	White	Male	0	0	55	United-States	<=50K
27418	49	Federal-gov	179869	HS-grad	9	Married-civ-spouse	Craft-repair	Husband	White	Male	0	0	40	United-States	>50K
24136	55	Federal-gov	305850	Prof-school	15	Married-civ-spouse	Exec-managerial	Husband	White	Male	15024	0	40	United-States	>50K
23455	70	Private	304570	Bachelors	13	Widowed	Machine-op-inspct	Other-relative	Asian-Pac-Islander	Male	0	0	32	Philippines	<=50K
1002	38	State-gov	343642	HS-grad	9	Married-civ-spouse	Prof-specialty	Wife	White	Female	0	0	40	United-States	>50K
5788	27	?	308995	Some-college	10	Divorced	?	Own-child	Black	Female	0	0	40	United-States	<=50K
13184	31	Private	159737	HS-grad	9	Never-married	Sales	Unmarried	Black	Female	0	0	58	United-States	<=50K
22115	43	Private	160674	HS-grad	9	Divorced	Exec-managerial	Unmarried	White	Female	0	0	40	United-States	<=50K
18115	28	State-gov	155397	Bachelors	13	Never-married	Exec-managerial	Not-in-family	White	Male	0	0	55	United-States	<=50K
30182	74	Self-emp-not-inc	292915	Bachelors	13	Married-civ-spouse	Exec-managerial	Husband	White	Male	0	1825	12	United-States	>50K
30971	41	Local-gov	33658	Some-college	10	Married-civ-spouse	Protective-serv	Husband	White	Male	0	0	45	United-States	>50K
18366	74	Self-emp-not-inc	192413	Prof-school	15	Divorced	Prof-specialty	Other-relative	White	Male	0	0	40	United-States	<=50K
23411	31	State-gov	207505	Doctorate	16	Married-civ-spouse	Prof-specialty	Husband	White	Male	0	1977	70	United-States	>50K
14639	25	Private	108317	Bachelors	13	Never-married	Exec-managerial	Not-in-family	White	Male	0	0	40	United-States	<=50K
29154	34	Private	101510	Bachelors	13	Divorced	Sales	Not-in-family	White	Female	0	0	50	United-States	>50K
7296	31	State-gov	75755	Doctorate	16	Married-civ-spouse	Exec-managerial	Husband	White	Male	7298	0	55	United-States	>50K
20303	24	Private	285457	HS-grad	9	Never-married	Other-service	Not-in-family	White	Male	0	0	40	United-States	<=50K
32082	49	Private	23776	Some-college	10	Married-civ-spouse	Adm-clerical	Husband	White	Male	0	0	40	United-States	<=50K
24958	24	Local-gov	387108	Some-college	10	Married-civ-spouse	Protective-serv	Husband	Black	Male	0	0	40	United-States	<=50K
24005	49	Private	64216	HS-grad	9	Divorced	Transport-moving	Not-in-family	White	Male	0	0	90	United-States	<=50K
18074	34	Federal-gov	436341	Some-college	10	Married-AF-spouse	Adm-clerical	Wife	White	Female	0	0	40	United-States	>50K
29192	52	Private	165681	Some-college	10	Married-civ-spouse	Craft-repair	Husband	White	Male	0	0	40	United-States	>50K
4387	52	Private	117700	HS-grad	9	Divorced	Adm-clerical	Not-in-family	White	Female	0	0	40	United-States	<=50K
811	22	?	219941	Some-college	10	Never-married	?	Own-child	Black	Male	0	0	40	United-States	<=50K
17055	26	Private	182308	Some-college	10	Married-civ-spouse	Prof-specialty	Husband	White	Male	0	0	40	United-States	<=50K
31762	61	Private	668362	1st-4th	2	Widowed	Handlers-cleaners	Not-in-family	White	Female	0	0	40	United-States	<=50K
18133	30	Private	207301	HS-grad	9	Never-married	Sales	Not-in-family	White	Female	0	1980	40	United-States	<=50K

	Age	Workclass	fnlwgt	Education	Education-num	Marital_Status	Occupation	Relationship	Race	Sex	Capital-gain	hrs_per_week	Native-Country
0	0.301370	0.875	0.044302	0.600000	0.800000	0.666667	0.071429	0.2	1.0	1.0	0.02174	0.397959	0.951220
1	0.452055	0.750	0.048238	0.600000	0.800000	0.333333	0.285714	0.0	1.0	1.0	0.00000	0.397959	0.951220
2	0.287671	0.500	0.138113	0.733333	0.533333	0.000000	0.428571	0.2	1.0	1.0	0.00000	0.397959	0.951220
3	0.493151	0.500	0.151068	0.066667	0.400000	0.333333	0.428571	0.0	0.5	1.0	0.00000	0.397959	0.951220
4	0.150685	0.500	0.221488	0.600000	0.800000	0.333333	0.714286	1.0	0.5	0.0	0.00000	0.397959	0.121951