Income Machine Learning Classification

A python code consisting of a series of machine learning classification functions to predict whether an adult has an income above or below 50K

Overview

I wanted to experiment with a few machine learning classification methods without spending much time wrangling data so I found some clean income data on Kaggle. This data contains age, gender, country of residentce, family status, education, and other descriptive data for each individual along with whether they make more than or less than 50K dollars per year. I wanted to see which machine learning algorithm could predict this most accurately based on the other variables.

Methodology

First I wanted to discover the data a little bit so I created a correlation heatmap to see how much each variable correlates with each other, especially the income. In order to do that I had to encode the data which I did with a simple label_encoder().

# Import packages

import pandas as pd

from sklearn.model_selection import train_test_split

from sklearn.preprocessing import StandardScaler

from sklearn.linear_model import LogisticRegression

from sklearn.metrics import accuracy_score, classification_report

from sklearn.tree import DecisionTreeClassifier

from sklearn.ensemble import RandomForestClassifier

from sklearn.neighbors import KNeighborsClassifier

from sklearn.compose import ColumnTransformer

from tensorflow import keras

from tensorflow.keras import layers

import seaborn as sns

import matplotlib.pyplot as plt

import numpy as np

from sklearn.preprocessing import LabelEncoder

# Specify the path to your CSV file

csv_file_path = 'C:/Users/jpapi/OneDrive/Documents/Jasons/Projects/AdultIncome/adult.csv'

# Read the CSV file into a DataFrame

df = pd.read_csv(csv_file_path)

# Define the label encoder

label_encoder = LabelEncoder()

# Apply encoder to data to make

df_encoded = df.apply(label_encoder.fit_transform)

# Create a correlation matrix

corr = df_encoded.corr()

# Set up the matplotlib plot configuration

f, ax = plt.subplots(figsize=(12, 10))

# Configure a custom diverging colormap

cmap = sns.diverging_palette(230, 20, as_cmap=True)

# Draw the heatmap

sns.heatmap(corr, annot=True, cmap=cmap)

Label Encoded Heatmmap

Nominal Data Issue

It was not until later did I realize that I can't do that, it is a mistake. A label encoder simply replaces non numerical values with a specific numerical value. This is not good practice when we have categorical, nominal values because it can insinuate that we have correlation even when we don't.

For example, if the occupational values of "Machine-op-inspect" gets a designation of a 1, "exec-managerial" is given a 2, and "Other-service" gets designated the number 6, both the correlation function and our machine learning algorithms will think that the machinist is more closely related to the manager than it is the other service worker, which is not the case at all. We want to view them nominally and not correlated to others of the same variable.

To fix this, I had to create a Cramer's V Statistic matrix and then map that. The Cramer's V Statistic heatmap does not have the all of the variables, that's because it does not account for actual numerical values that are ordinal such as age, capital.gain, hours.per.week, etc. In which case the original heatmap is sufficient.

# Calculate Cramér's V for each pair of categorical variables

def cramers_v(x, y):

confusion_matrix = pd.crosstab(x, y)

chi2, _, _, _ = chi2_contingency(confusion_matrix)

n = confusion_matrix.sum().sum()

phi2 = chi2 / n

r, k = confusion_matrix.shape

phi2corr = max(0, phi2 - ((k-1)*(r-1))/(n-1))

rcorr = r - ((r-1)**2)/(n-1)

kcorr = k - ((k-1)**2)/(n-1)

return np.sqrt(phi2corr / min((kcorr-1), (rcorr-1)))

# Create a correlation matrix

categorical_columns = df.select_dtypes(include='object').columns

corr_matrix = pd.DataFrame(index=categorical_columns, columns=categorical_columns, dtype=float)

for col1 in categorical_columns:

for col2 in categorical_columns:

corr_matrix.loc[col1, col2] = cramers_v(df[col1], df[col2])

# Create a heatmap using seaborn

plt.figure(figsize=(12, 10))

sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', linewidths=.5)

plt.title('Cramér\'s V Correlation Heatmap (15 Column DataFrame)')

plt.show()

Cramer's V Statisic Correlation Heatmap

Analysis

This is very interesting to see. It is not surprising to see that education, workclass, and age have a strong correlation with income. What suprises me is that marital.status and relationship has such a strong correlation with income, my hypothesis is that the income in the data could actually be household income. Therefore, those who are married have a much higher likelihood of earning over 50K dollars per year because of a likely dual income. That is simply a guess though. Another useful insight of this heatmap is that marital status and sex are highly correlated, this tells me that this is likely not a sample that reflects the entire populations. Lastly, a rather sad insight is that race and sex are correlated with income in any way.

Next, we prepare the data for calculations...

# Assign 0's and 1's to the instances in the columns we want to predict

df['income'] = df['income'].map({'<=50K': 0, '>50K': 1})

# One-hot-encode the columns that have string representations

# I ran the functions with the encoded data above and the results were nearly identical

df = pd.get_dummies(df, columns=['workclass','education', 'marital.status', 'occupation', 'relationship', 'race', 'sex', 'native.country'])

# Separate the target prediction column from the rest

y = df['income']

x = df.drop('income', axis=1)

# Specify which data is which and how large the training and test data should be

X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)

# Define the standard scaler and apply it to our data

# Ensure compatability and potentially improve accuracy

scaler = StandardScaler()

X_train_scaled = scaler.fit_transform(X_train)

X_test_scaled = scaler.transform(X_test)

Machine Learning Code and Results

Logistic Regression

Decision Tree

Random Forrest

K-Nearest Neighbor

Neural Network

Neural Network Results

Analysis

Grading on pure accuracy, the Random Forrest and Neural Network are the best performers at 85% each. The other models all performed fairly well also, Logistic Regression and K-Nearest Neighbor being the least accurate still produced 79% each. I thought that potentially I could make the neural network perform slightly better by adjusting some of the parameters such as the activation, epochs, batch size, and optimizer but simply could not make it break over 85%. In fact, many of the attempts actually made it perform slightly worse. Even though this training data doesn't have any personal meaning, it was a very fun experiment for me.