Hello there, continuing on sharing knowledge accumulated throughout my Kaggle competitions, here’s my lessons in the June Kaggle Playground Series — Classification with an Academic Success Dataset.
A few words (if you are new here), I am currently a Data Analyst, self-teaching to transition into Data Science. I found Kaggle Playground Series to be a great resource to get hands-on experience with Machine Learning with ‘close-to-real-life’ datasets and extraordinary Data Scientist community to learn from.
I have just started participating in the Series and already got exposed to many basic to advance methods in Machine Learning that I can’t wait to share them all here!
The competition challenge is to classify if a student will ‘Graduate’, ‘Enroll’, or ‘Dropout’, given their demographic and academic performance data. Coming into this problem, this is very FIRST time building a machine learning model that have multiple class output 🙂 I didn’t even know what to do with the ‘Target’ column. So, calm down and…
The first step is to use LabelEncoder() method to encode the target as numbers.
le = LabelEncoder()
# Convert the target variable 'Target' to numerical data
y_train = le.fit_transform(y_train)
y_valid = le.transform(y_valid)
With the above code, we are transforming ‘Graduate’ to 0, ‘Enroll’ to 1, and ‘Dropout’ to 2, for example. Then, we will apply different algorithms to output either 0, 1, or 2.
Another important thing is that we will need to transform the numbers back to its original form. I don’t know if this is a rookie mistake but I kept getting bottom low accuracy score just because I inverse_transform my model output incorrectly.
Here’s my code on how to do it right!
majority_voting_test_predictions = ensemble_test_predictions.mode(axis=1).iloc[:, 0]
y_predictions = list(map(int, majority_voting_test_predictions))sample_submission_df = pd.read_csv('/kaggle/input/playground-series-s4e6/sample_submission.csv')
sample_submission_df['Target'] = y_predictions
sample_submission_df['Target'] = le.inverse_transform(sample_submission_df['Target'])
sample_submission_df.to_csv('/kaggle/working/submission.csv', index=False)
sample_submission_df.head()
Another feature engineering technique in my pocket!
I started with the feature engineering technique I learned from the previous competition, which is adding statistical features. However, it did not improve my model accuracy score this time.
I then learned from other data fellows on Kaggle that we can separate the features into numerical and categorical groups and engineer them accordingly.
Preprocessing — Separating features into numerical and categorical
This technique was particularly helpful for this dataset because there are some features appeared to be numerical but actually categorical where they were number-coded, such as ‘Debtor’, ‘Gender’, etc.
But given 38 features, how could I know which one is actually categorical disguising as numerical? The answer is EDA – The biggest lesson I learned from my last competition that I shared in the previous story.
Once you can identify ‘by eyes’ which ones are categorical and which aren’t , we can manually categorize them or use this line of code to save some time.
# Categorical columns: if the number of unique values is 8 or fewer
cat_cols = [col for col in X.columns if X[col].nunique() <= 8]
# Numerical columns: if the number of unique values is 9 or more
num_cols = [col for col in X.columns if X[col].nunique() >= 9]
Well, my feature engineering game did not stop there! Maybe taking extra steps in this stage has helped me land in top 150 out of 2700 entries in this month playground competition.
After separating them, I performed different engineering techniques on each type of data. (again, this is my first time — kinda nervous)
1. Standard Scaler for Numerical Features
I have heard of scaling but never implemented it before, so I took advantage of this dataset to learn more about it!
For whom it may help…
Scaling is the process of adjusting the range of features so that they have comparable scales. It is essential because many machine learning algorithms perform better when numerical input variables are scaled to a standard range. For instance, if one feature ranges from 1 to 1000 and another from 0 to 1, the model might consider the first feature more important due to its larger range, even if it’s not more informative.
Standard Scaler is a common method used for scaling. It transforms the data such that the mean is 0 and the standard deviation is 1. And this is the method I used for my model.
2. One-Hot Encoding for Categorical Features
Next, I will also engineer my categorical features using One-Hot Encoding. For whom it may help…
One-Hot Encoding is a technique used to convert categorical data into a format that can be provided to machine learning algorithms to improve their performance. It transforms each categorical value into a new binary column, indicating the presence (1) or absence (0) of that value.
In my opinion, this encoding technique is helpful for the dataset because it
- Prevents Misinterpretation: Avoids the issue of algorithms interpreting numeric values as ordered (which can happen with label encoding).
- Increases Data Dimensionality: Can increase the number of features when the dataset originally had only 38 features.
Here’s my code for the entire engineering process
# Define the preprocessing for numerical and categorical columns
preprocessor = ColumnTransformer(
transformers=[
('num', StandardScaler(), num_cols),
('cat', OneHotEncoder(handle_unknown='ignore'), cat_cols)
])# Fit and transform the training data
X_train_processed = preprocessor.fit_transform(X_train)
X_valid_processed = preprocessor.transform(X_valid)
df_test_processed = preprocessor.transform(df_test)
# Convert processed arrays back to DataFrames (optional)
X_train_processed = pd.DataFrame(X_train_processed, columns=preprocessor.get_feature_names_out())
X_valid_processed = pd.DataFrame(X_valid_processed, columns=preprocessor.get_feature_names_out())
df_test_processed = pd.DataFrame(df_test_processed, columns=preprocessor.get_feature_names_out())
# Keep dataframe names consistent
X_train = X_train_processed
X_valid = X_valid_processed
df_test = df_test_processed
Remember to scale & encode on the df_test too!
When building machine learning model, I never know what are the just-right values for any of my parameters.
Should the max_depth 7 or 16?
Reviewing other notebooks on Kaggle, I was also stunned when seeing a genius coming up with setting reg_lambda to be 29.548955808402486 and having much higher accuracy score than me! I was like —
Where is this reg_lambda = 29.548955808402486 coming from?
Well, the answer is hyperparameter tuning.
For whom it may help…
Hyperparameter tuning is the process of finding the best set of hyperparameters for a machine learning model. Just a warning ahead, because of the extensive search includes, it will take you a ton of time to find the sweet spots! Literally, I ran the code, cooked and finished my dinner, and until then I had the best parameters.
Example Workflow
- Define Hyperparameters: Determine which hyperparameters to tune and their respective ranges.
- Choose a Tuning Method: Select a method like grid search or Bayesian optimization.
- Implement and Run: Use tools like GridSearchCV or Optuna to run the tuning process. I used Optuna.
- Analyze Results: Evaluate the performance metrics to identify the best hyperparameters.
- Refine and Repeat: If necessary, refine the hyperparameter ranges or try different tuning methods for better results.
Steps in Hyperparameter Tuning
- Choose the hyperparameter space: Define the range of values for each hyperparameter to search.
- Select a search strategy: Decide on grid search, random search, Bayesian optimization, etc.
- Evaluate the model: Train the model with different hyperparameters and evaluate its performance using cross-validation or a validation set.
- Select the best hyperparameters: Choose the set of hyperparameters that results in the best performance according to the chosen evaluation metric.
- Retrain the model: Retrain the final model using the best hyperparameters on the full training dataset.
Given my dedication to this competition, I used Optuna in all of my machine learning algorithms. Here’s the code I used for XGBoost:
import optuna
from xgboost import XGBClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score# Define objective function for Optuna
def objective(trial):
# Define hyperparameters to search
params = {
'booster': 'gbtree',
'objective': 'multi:softmax',
'eval_metric': 'merror',
'max_depth': trial.suggest_int('max_depth', 3, 10),
'learning_rate': trial.suggest_float('learning_rate', 0.01, 0.1),
'min_child_weight': trial.suggest_float('min_child_weight', 1, 10),
'colsample_bytree': trial.suggest_float('colsample_bytree', 0.5, 1.0),
'reg_alpha': trial.suggest_float('reg_alpha', 0.01, 10.0),
'reg_lambda': trial.suggest_float('reg_lambda', 0.01, 10.0),
'subsample': trial.suggest_float('subsample', 0.5, 1.0),
'n_estimators': trial.suggest_int('n_estimators', 100, 1000),
'random_state': 0
}
# Split the training data into training and validation sets
X_train_split, X_valid_split, y_train_split, y_valid_split = train_test_split(X_train, y_train, test_size=0.2, random_state=0)
# Train XGBoost model with current hyperparameters
clf = XGBClassifier(**params)
clf.fit(X_train_split, y_train_split)
# Predict on validation set
y_pred = clf.predict(X_valid_split)
# Calculate evaluation metric on validation set
accuracy = accuracy_score(y_valid_split, y_pred)
return accuracy
# Optimize hyperparameters using Optuna
study = optuna.create_study(direction='maximize')
study.optimize(objective, n_trials=50)
# Get best hyperparameters
best_params = study.best_params
print("Best Hyperparameters:", best_params)
# Train final model with best hyperparameters
xgb = XGBClassifier(**best_params)
xgb.fit(X_train, y_train)
# Predict on validation data
y_pred_xgb = xgb.predict(X_valid)
# Calculate accuracy on validation data
accuracy = accuracy_score(y_valid, y_pred_xgb)
print("Accuracy on Validation Data:", accuracy)
As I mentioned in my previous blog, I’ve first learned about ensembling models and started to apply Stacking models. This episode, I used Voting. Why? may you asked.
Trials-and-errors was how I decided on which method to use.
In this challenge, I used both Voting and Stacking, then evaluated the models on accuracy score. Voting has given higher cross-validated accuracy score under a shorter amount of running as well. Voting was clearly a winner in this case.
Speaking of Voting methods, there are three main ones.
- Majority (Hard Voting). In majority voting, each individual model in the ensemble makes a prediction, and the final output is the class that receives the majority of the votes. (+) Simple to implement. (-) All models have equal weight, which may not be optimal if some models are more accurate than others.
- Soft Voting. In soft voting, each model outputs a probability for each class, and the final prediction is the class with the highest average probability across all models. (+) Often more accurate than majority voting because it uses more information from the models. (-) Requires models to output probabilities, which may not be possible for all types of models.
- Weight Average. In weighted voting, each model’s vote is weighted by a pre-defined value that reflects the model’s accuracy or importance. The final prediction is the class that receives the highest weighted sum of votes. (+) More flexible and can improve accuracy by giving more weight to better-performing models. (-) Requires careful tuning of weights, which can be complex. If weights are not chosen properly, it may not lead to significant improvements in performance.
I also used all of them and selected the best performer.
ensemble_accuracy = {
'Majority Voting': accuracy_score(y_valid, majority_voting_test_predictions),
'Soft Voting': accuracy_score(y_valid, soft_voting_predictions),
'Weighted Average': accuracy_score(y_valid, weighted_avg_predictions)
}print("Ensemble Accuracy:")
for method, acc in ensemble_accuracy.items():
print(f"{method}: {acc:.5f}")
I admit it.
I think that using Neural Networks just for a classification task is overkill!
However, if I don’t implement it, I will never know how.
And turns out, it was not that complicated to implement with Keras.
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout
from tensorflow.keras.utils import to_categorical
from sklearn.metrics import accuracy_score
from tensorflow.keras.callbacks import EarlyStopping# Convert to categorical (one-hot encoding)
y_train_categorical = to_categorical(y_train)
y_valid_categorical = to_categorical(y_valid)
# Define the neural network model
model = Sequential()
model.add(Dense(256, input_dim=X_train.shape[1], activation='relu'))
model.add(Dropout(0.2))
model.add(Dense(128, activation='relu'))
model.add(Dense(64, activation='relu'))
model.add(Dense(y_train_categorical.shape[1], activation='softmax'))
# Compile the model
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])
# Train the model
history = model.fit(X_train, y_train_categorical, epochs=50, batch_size=32, validation_data=(X_valid, y_valid_categorical))
# Evaluate the model
loss, accuracy = model.evaluate(X_valid, y_valid_categorical)
print("Validation Accuracy:", accuracy)
# Predict with Neural Network
nn_pred_probs = model.predict(X_valid)
nn_pred = np.argmax(nn_pred_probs, axis=1)
Here’s my step-by-step explanation:
- Define the Neural Network Model — Sequential(). This is a linear stack of layers, where each layer has exactly one input tensor and one output tensor.
- Add dense (fully connected) layers with 256 neurons.
input_dim=X_train.shape[1]
specifies the input shape, which matches the number of features in the training data. The activation function is ReLU (Rectified Linear Unit). - Add a dropout layer with a dropout rate of 0.2.
Dropout is a regularization technique that helps prevent overfitting by randomly setting 20% of the input units to 0 during training. - Adds the output layer with a number of neurons equal to the number of classes in the target variable. The softmax activation function is used for multi-class classification, producing a probability distribution over the classes.
- Compile the model
- Train the model
- Evaluate the model
- Predict the output
If you have never heard of it, Confusion Matrix (CM) is one of the greatest tools to evaluate a classification machine learning model. I have learned about it through this amazing StatQuest.
However, not until this competition, had I seen it in actions.
Unlike overall accuracy, the confusion matrix provides insights into how well the model performs on each class. It shows the number of true positives, true negatives, false positives, and false negatives for each class. By examining which classes are often confused with each other, you can understand specific weaknesses of your model. For example, if your model frequently misclassifies one class as another, this indicates a need for further investigation.
From the CM, you can also calculate Precision, Recall, and F1-Score. They are especially important in imbalanced datasets where accuracy alone can be misleading.
- Precision: The ratio of correctly predicted positive observations to the total predicted positives.
- Recall (Sensitivity): The ratio of correctly predicted positive observations to all the actual positives.
- F1-Score: The harmonic mean of precision and recall, providing a single measure of a test’s accuracy.
- Specificity: The ratio of correctly predicted negative observations to all the actual negatives. This is also known as the true negative rate.
For models that output probabilities, the confusion matrix can help in adjusting decision thresholds to balance precision and recall according to the specific needs of the application.
If you are still new to this concept and confused because of the above blurp, please check out this StatQuest video! My journey in machine learning turned into a new page after watching Josh Starmer explaining it.
After a month improving my classification model, I ended with an accuracy score of 0.83805, placing 145/2691 participants! I am honestly surprised about the leaderboard results and I’d be happy to celebrate this milestone with you guys!
I know there are way more to grow, so please, please, please…take a peek at my submission and give me some more advice!
Thank you Kaggle Team for hosting the competition and all of the competitors for teaching me through your public notebooks. I’d honored to connect and learn with you all!
with love,
sue