Kaggle Academic Success Classification Competition
A Random Forrest Classification Model attempting to predict the academic success in a kaggle.com competition
Overview
I wanted to try my hand at a Kaggle competition. The higher prize money competitions have some real professionals teaming up to take them on. I decided I would be a bit underpowered in nearly every way--experience, knowledge, hardware, time...Anyway, this competition looked interesting and feasible. It's simply an attempt to correctly predict whether a virtual student will graduate, dropout, or is currently enrolled based on some a healthy number of demographics and school history.
Prelims
To begin, I simply ran the data through the normal gauntlet of classification algorithms such as logistic regression, decision trees, random forrest, K-nearest neighbor, and then a simple neural network. I found that the random forest and neural network were the most accurate. Around 80-82%. I decided to stick with the Random Forest
# Import packages
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.compose import ColumnTransformer
from tensorflow import keras
from tensorflow.keras import layers
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
from sklearn.preprocessing import LabelEncoder
from sklearn.feature_selection import RFECV
from sklearn.model_selection import GridSearchCV
# Specify the path to your CSV file
csv_file_path = 'C:/Users/jpapi/OneDrive/Documents/Jasons/Projects/train.csv/train.csv'
# Read the CSV file into a DataFrame and Drop some of the least important features (found later)
df = pd.read_csv(csv_file_path)
df = df.drop('Nacionality', axis=1)
df = df.drop('International', axis=1)
df = df.drop('Educational special needs', axis=1)
df = df.drop('Daytime/evening attendance', axis=1)
df = df.drop('Curricular units 2nd sem (credited)',axis=1)
df = df.drop('Curricular units 1st sem (without evaluations)',axis=1)
df = df.drop('Curricular units 2nd sem (without evaluations)',axis=1)
df = df.drop('Curricular units 1st sem (credited)',axis=1)
df = df.drop('Marital status',axis=1)
df = df.drop('Previous qualification',axis=1)
df = df.drop('id', axis=1)
Out of the Box the model did pretty well with only 50 estimators and very little else specified. Kaggle gave me an accuracy of .82170, which made me pretty optimistic considering the current leader is near .84.
# Separate the target prediction column from the rest
y = df['Target']
x = df.drop('Target', axis=1)
# Specify which data is which and how large the training and test data should be
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)
# Define the random forrest function (These are the parameters found from hyperparameter tuning)
forest = RandomForestClassifier(n_estimators=300, random_state=42, max_depth=20, max_features='log2', min_samples_leaf=1, min_samples_split= 10)
print(X_train.columns)
# Train the forest
forest.fit(X_train, y_train)
# Make predictions on the test set
y_pred = forest.predict(X_test)
# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy: {accuracy:.2f}')
print(classification_report(y_test, y_pred))
Hyperparameter Tuning
The first thing I thought I should try is to do some hyperparameter tuning. To do this, I used the grid search from the code below. I didn't have the computer horsepower needed to do that many 'fits' with that many estimators. It takes 13 seconds just to run it with 300 estimators, times that by 1215 and it's going to take a little time. By running it with only 50 I was able to find the other variables then by only using those I could test between 50, 200, and 300 estimators. This isn't ideal but I don't think changing parameters from what I'm currently using is going to make any significant impact. If I was within .01 of the leader I would consider trying to optimize even more.
# Define the hyperparameters and their values for grid search
param_grid = {
'n_estimators': [100, 200, 300],
'max_depth': [10, 20, None],
'min_samples_split': [2, 5, 10],
'min_samples_leaf': [1, 2, 4],
'max_features': ['auto', 'sqrt', 'log2']
}
# Set up the grid search
grid_search = GridSearchCV(estimator=forest, param_grid=param_grid,
cv=5, n_jobs=-1, verbose=2)
# Fit the model
grid_search.fit(X_train, y_train)
# Get the best parameters
best_params = grid_search.best_params_
# Output the best parameters
print("Best parameters found: ", best_params)
# Use the best model
best_rf = grid_search.best_estimator_
# Evaluate the model
accuracy = best_rf.score(X_test, y_test)
print(f'Accuracy: {accuracy:.2f}')
Determining Feature Importance
Next I wanted to figure out which features are most important so that I could remove any that don't offer anything to the model. I determined that 'Nacionality', 'International', 'Educational special needs', 'Curricular units 2nd sem (credited)', 'Curricular units 1st sem (without evaluations), 'Curricular units 2nd sem (without evaluations)', 'Curricular units 1st sem (credited)', 'Marital status', 'Previous qualification', and 'id' offer virtually nothing to the model and so I removed them. This is what the rest of them look like in table form.
feature_importances = forest.feature_importances_
# Create a DataFrame for visualization
feature_importance_df = pd.DataFrame({
'Feature': X_train.columns,
'Importance': feature_importances
})
# Sort the DataFrame by importance
feature_importance_df = feature_importance_df.sort_values(by='Importance', ascending=False)
# Plot feature importances
plt.figure(figsize=(10, 6))
sns.barplot(x='Importance', y='Feature', data=feature_importance_df)
plt.title('Feature Importance')
plt.show()
# Predict using the trained model
Testdatapath = 'C:/Users/jpapi/OneDrive/Documents/Jasons/Projects/test.csv/test.csv'
testdf = pd.read_csv(Testdatapath)
iddf = pd.DataFrame(testdf['id'])
print(iddf.columns)
testdf = testdf.drop('Nacionality', axis=1)
testdf = testdf.drop('International', axis=1)
testdf = testdf.drop('Educational special needs', axis=1)
testdf = testdf.drop('Daytime/evening attendance', axis=1)
testdf = testdf.drop('Curricular units 2nd sem (credited)',axis=1)
testdf = testdf.drop('Curricular units 1st sem (without evaluations)',axis=1)
testdf = testdf.drop('Curricular units 2nd sem (without evaluations)',axis=1)
testdf = testdf.drop('Curricular units 1st sem (credited)',axis=1)
testdf = testdf.drop('Marital status',axis=1)
testdf = testdf.drop('Previous qualification',axis=1)
testdf = testdf.drop('id', axis=1)
print(testdf.columns)
print(X_train.columns)
# Predict using RandomForestClassifier
new_predictions = forest.predict(testdf)
End results
The fortunate...I reran the model with my hyperparameters and with the unimportant columns removed and found that it did improve the model. The improvement looks meager at just under 1% but considering that the winner was only at .84 accuracy I think it's very good improvement. I nearly cut the difference in half. It looks like most of the winners used a GBM algorithm with hyperparameter tuning so I'm thinking that no matter what I try with my random forrest I would never quite get to that lower 84% accuracy range. Anyway, the competition was only 1 month long. Time to work on the next one.