Kaggle Academic Success Classification Competition

A Random Forrest Classification Model attempting to predict the academic success in a kaggle.com competition

Overview

I wanted to try my hand at a Kaggle competition. The higher prize money competitions have some real professionals teaming up to take them on. I decided I would be a bit underpowered in nearly every way--experience, knowledge, hardware, time...Anyway, this competition looked interesting and feasible. It's simply an attempt to correctly predict whether a virtual student will graduate, dropout, or is currently enrolled based on some a healthy number of demographics and school history. 

Prelims

To begin, I simply ran the data through the normal gauntlet of classification algorithms such as logistic regression, decision trees, random forrest, K-nearest neighbor, and then a simple neural network. I found that the random forest and neural network were the most accurate. Around 80-82%. I decided to stick with the Random Forest

# Import packages

import pandas as pd

from sklearn.model_selection import train_test_split

from sklearn.preprocessing import StandardScaler

from sklearn.linear_model import LogisticRegression

from sklearn.metrics import accuracy_score, classification_report

from sklearn.tree import DecisionTreeClassifier

from sklearn.ensemble import RandomForestClassifier

from sklearn.neighbors import KNeighborsClassifier

from sklearn.compose import ColumnTransformer

from tensorflow import keras

from tensorflow.keras import layers

import seaborn as sns

import matplotlib.pyplot as plt

import numpy as np

from sklearn.preprocessing import LabelEncoder

from sklearn.feature_selection import RFECV

from sklearn.model_selection import GridSearchCV


# Specify the path to your CSV file

csv_file_path = 'C:/Users/jpapi/OneDrive/Documents/Jasons/Projects/train.csv/train.csv'


# Read the CSV file into a DataFrame and Drop some of the least important features (found later)

df = pd.read_csv(csv_file_path)

df = df.drop('Nacionality', axis=1)

df = df.drop('International', axis=1)

df = df.drop('Educational special needs', axis=1)

df = df.drop('Daytime/evening attendance', axis=1)

df = df.drop('Curricular units 2nd sem (credited)',axis=1)

df = df.drop('Curricular units 1st sem (without evaluations)',axis=1)

df = df.drop('Curricular units 2nd sem (without evaluations)',axis=1)

df = df.drop('Curricular units 1st sem (credited)',axis=1)

df = df.drop('Marital status',axis=1)

df = df.drop('Previous qualification',axis=1)

df = df.drop('id', axis=1)

Out of the Box the model did pretty well with only 50 estimators and very little else specified. Kaggle gave me an accuracy of .82170, which made me pretty optimistic considering the current leader is near .84. 


# Separate the target prediction column from the rest

y = df['Target']

x = df.drop('Target', axis=1)


# Specify which data is which and how large the training and test data should be

X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)


# Define the random forrest function (These are the parameters found from hyperparameter tuning)

forest = RandomForestClassifier(n_estimators=300, random_state=42, max_depth=20, max_features='log2', min_samples_leaf=1, min_samples_split= 10)


print(X_train.columns)

# Train the forest

forest.fit(X_train, y_train)


# Make predictions on the test set

y_pred = forest.predict(X_test)


# Evaluate the model

accuracy = accuracy_score(y_test, y_pred)

print(f'Accuracy: {accuracy:.2f}')

print(classification_report(y_test, y_pred))


Hyperparameter Tuning

The first thing I thought I should try is to do some hyperparameter tuning. To do this, I used the grid search from the code below. I didn't have the computer horsepower needed to do that many 'fits' with that many estimators. It takes 13 seconds just to run it with 300 estimators, times that by 1215 and it's going to take a little time. By  running it with only 50 I was able to find the other variables then by only using those I could test between 50, 200,  and 300 estimators. This isn't ideal but I don't think changing parameters from what I'm currently using is going to make any significant impact. If I was within .01 of the leader I would consider trying to optimize even more.

# Define the hyperparameters and their values for grid search

param_grid = {

    'n_estimators': [100, 200, 300],

    'max_depth': [10, 20, None],

    'min_samples_split': [2, 5, 10],

    'min_samples_leaf': [1, 2, 4],

    'max_features': ['auto', 'sqrt', 'log2']

}


# Set up the grid search

grid_search = GridSearchCV(estimator=forest, param_grid=param_grid,

                           cv=5, n_jobs=-1, verbose=2)


# Fit the model

grid_search.fit(X_train, y_train)


# Get the best parameters

best_params = grid_search.best_params_


# Output the best parameters

print("Best parameters found: ", best_params)


# Use the best model

best_rf = grid_search.best_estimator_


# Evaluate the model

accuracy = best_rf.score(X_test, y_test)

print(f'Accuracy: {accuracy:.2f}')


Determining Feature Importance 

Next I wanted to figure out which features are most important so that I could remove any that don't offer anything to the model. I determined that 'Nacionality', 'International', 'Educational special needs', 'Curricular units 2nd sem (credited)', 'Curricular units 1st sem (without evaluations), 'Curricular units 2nd sem (without evaluations)', 'Curricular units 1st sem (credited)', 'Marital status', 'Previous qualification', and  'id' offer virtually nothing to the model and so I removed them. This is what the rest of them look like in table form.

feature_importances = forest.feature_importances_


# Create a DataFrame for visualization

feature_importance_df = pd.DataFrame({

    'Feature': X_train.columns,

    'Importance': feature_importances

})


# Sort the DataFrame by importance

feature_importance_df = feature_importance_df.sort_values(by='Importance', ascending=False)


# Plot feature importances

plt.figure(figsize=(10, 6))

sns.barplot(x='Importance', y='Feature', data=feature_importance_df)

plt.title('Feature Importance')

plt.show()



# Predict using the trained model

Testdatapath = 'C:/Users/jpapi/OneDrive/Documents/Jasons/Projects/test.csv/test.csv'

testdf = pd.read_csv(Testdatapath)

iddf = pd.DataFrame(testdf['id'])

print(iddf.columns)

testdf = testdf.drop('Nacionality', axis=1)

testdf = testdf.drop('International', axis=1)

testdf = testdf.drop('Educational special needs', axis=1)

testdf = testdf.drop('Daytime/evening attendance', axis=1)

testdf = testdf.drop('Curricular units 2nd sem (credited)',axis=1)

testdf = testdf.drop('Curricular units 1st sem (without evaluations)',axis=1)

testdf = testdf.drop('Curricular units 2nd sem (without evaluations)',axis=1)

testdf = testdf.drop('Curricular units 1st sem (credited)',axis=1)

testdf = testdf.drop('Marital status',axis=1)

testdf = testdf.drop('Previous qualification',axis=1)

testdf = testdf.drop('id', axis=1)

print(testdf.columns)

print(X_train.columns)


# Predict using RandomForestClassifier

new_predictions = forest.predict(testdf)


End results

The fortunate...I reran the model with my hyperparameters and with the unimportant columns removed and found that it did improve the model. The improvement looks meager at just under 1% but considering that the winner was only at .84 accuracy I think it's very good improvement. I nearly cut the difference in half. It looks like most of the winners used a GBM algorithm with hyperparameter tuning so I'm thinking that no matter what I try with my random forrest I would never quite get to that lower 84% accuracy range. Anyway, the competition was only 1 month long. Time to work on the next one.