Income Machine Learning Classification
A python code consisting of a series of machine learning classification functions to predict whether an adult has an income above or below 50K
Overview
I wanted to experiment with a few machine learning classification methods without spending much time wrangling data so I found some clean income data on Kaggle. This data contains age, gender, country of residentce, family status, education, and other descriptive data for each individual along with whether they make more than or less than 50K dollars per year. I wanted to see which machine learning algorithm could predict this most accurately based on the other variables.
Methodology
First I wanted to discover the data a little bit so I created a correlation heatmap to see how much each variable correlates with each other, especially the income. In order to do that I had to encode the data which I did with a simple label_encoder().
# Import packages
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.compose import ColumnTransformer
from tensorflow import keras
from tensorflow.keras import layers
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
from sklearn.preprocessing import LabelEncoder
# Specify the path to your CSV file
csv_file_path = 'C:/Users/jpapi/OneDrive/Documents/Jasons/Projects/AdultIncome/adult.csv'
# Read the CSV file into a DataFrame
df = pd.read_csv(csv_file_path)
# Define the label encoder
label_encoder = LabelEncoder()
# Apply encoder to data to make
df_encoded = df.apply(label_encoder.fit_transform)
# Create a correlation matrix
corr = df_encoded.corr()
# Set up the matplotlib plot configuration
f, ax = plt.subplots(figsize=(12, 10))
# Configure a custom diverging colormap
cmap = sns.diverging_palette(230, 20, as_cmap=True)
# Draw the heatmap
sns.heatmap(corr, annot=True, cmap=cmap)
Label Encoded Heatmmap
Nominal Data Issue
It was not until later did I realize that I can't do that, it is a mistake. A label encoder simply replaces non numerical values with a specific numerical value. This is not good practice when we have categorical, nominal values because it can insinuate that we have correlation even when we don't.
For example, if the occupational values of "Machine-op-inspect" gets a designation of a 1, "exec-managerial" is given a 2, and "Other-service" gets designated the number 6, both the correlation function and our machine learning algorithms will think that the machinist is more closely related to the manager than it is the other service worker, which is not the case at all. We want to view them nominally and not correlated to others of the same variable.
To fix this, I had to create a Cramer's V Statistic matrix and then map that. The Cramer's V Statistic heatmap does not have the all of the variables, that's because it does not account for actual numerical values that are ordinal such as age, capital.gain, hours.per.week, etc. In which case the original heatmap is sufficient.
# Calculate Cramér's V for each pair of categorical variables
def cramers_v(x, y):
confusion_matrix = pd.crosstab(x, y)
chi2, _, _, _ = chi2_contingency(confusion_matrix)
n = confusion_matrix.sum().sum()
phi2 = chi2 / n
r, k = confusion_matrix.shape
phi2corr = max(0, phi2 - ((k-1)*(r-1))/(n-1))
rcorr = r - ((r-1)**2)/(n-1)
kcorr = k - ((k-1)**2)/(n-1)
return np.sqrt(phi2corr / min((kcorr-1), (rcorr-1)))
# Create a correlation matrix
categorical_columns = df.select_dtypes(include='object').columns
corr_matrix = pd.DataFrame(index=categorical_columns, columns=categorical_columns, dtype=float)
for col1 in categorical_columns:
for col2 in categorical_columns:
corr_matrix.loc[col1, col2] = cramers_v(df[col1], df[col2])
# Create a heatmap using seaborn
plt.figure(figsize=(12, 10))
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', linewidths=.5)
plt.title('Cramér\'s V Correlation Heatmap (15 Column DataFrame)')
plt.show()
Cramer's V Statisic Correlation Heatmap
Analysis
This is very interesting to see. It is not surprising to see that education, workclass, and age have a strong correlation with income. What suprises me is that marital.status and relationship has such a strong correlation with income, my hypothesis is that the income in the data could actually be household income. Therefore, those who are married have a much higher likelihood of earning over 50K dollars per year because of a likely dual income. That is simply a guess though. Another useful insight of this heatmap is that marital status and sex are highly correlated, this tells me that this is likely not a sample that reflects the entire populations. Lastly, a rather sad insight is that race and sex are correlated with income in any way.
Next, we prepare the data for calculations...
# Assign 0's and 1's to the instances in the columns we want to predict
df['income'] = df['income'].map({'<=50K': 0, '>50K': 1})
# One-hot-encode the columns that have string representations
# I ran the functions with the encoded data above and the results were nearly identical
df = pd.get_dummies(df, columns=['workclass','education', 'marital.status', 'occupation', 'relationship', 'race', 'sex', 'native.country'])
# Separate the target prediction column from the rest
y = df['income']
x = df.drop('income', axis=1)
# Specify which data is which and how large the training and test data should be
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)
# Define the standard scaler and apply it to our data
# Ensure compatability and potentially improve accuracy
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
Machine Learning Code and Results
Logistic Regression
Decision Tree
Random Forrest
K-Nearest Neighbor
Neural Network
Neural Network Results
Analysis
Grading on pure accuracy, the Random Forrest and Neural Network are the best performers at 85% each. The other models all performed fairly well also, Logistic Regression and K-Nearest Neighbor being the least accurate still produced 79% each. I thought that potentially I could make the neural network perform slightly better by adjusting some of the parameters such as the activation, epochs, batch size, and optimizer but simply could not make it break over 85%. In fact, many of the attempts actually made it perform slightly worse. Even though this training data doesn't have any personal meaning, it was a very fun experiment for me.