Author: Ricardo Daniel Teixeira Gonçalves Course: Elements of Artificial Intelligence and Data Science (EIACD)
This project, developed for Assignment 2 of the "Elements of Artificial Intelligence and Data Science" (EIACD) course at the University of Porto, focuses on building a complete Machine Learning (ML) pipeline. The primary goal is to predict whether a student will pass or fail their final exam, serving as an early intervention tool to identify at-risk students and enable proactive support.
The system utilizes an adapted real-world dataset from Cortez and Silva (2008), which includes academic, demographic, and social features of Portuguese secondary school students.
The dataset, student-data.csv, combines 30 attributes from students in two Portuguese secondary schools ("GP" - Gabriel Pereira and "MS" - Mousinho da Silveira). Key attributes include:
- Demographic:
sex,age,address(urban/rural),famsize,Pstatus(parent's cohabitation). - Parental Background:
Medu(mother's education),Fedu(father's education),Mjob,Fjob. - School-related:
reason(for choosing school),guardian,traveltime,studytime,failures(past),schoolsup,famsup,paid(extra classes),activities,nursery,higher(wants higher education),absences. - Social/Lifestyle:
internet,romantic,famrel(family relations),freetime,goout,Dalc(workday alcohol),Walc(weekend alcohol),health. - Target Variable:
passed(originally G3 grade, transformed to binary yes/no).
A detailed data dictionary is available within the notebook (Section 2.1).
The project follows a standard machine learning pipeline:
-
Data Loading & Initial Exploration:
- Import necessary libraries (Pandas, NumPy, Scikit-learn, Matplotlib, Seaborn, Imblearn).
- Load the dataset.
-
Exploratory Data Analysis (EDA):
- Data dictionary review and dataset preview (
.head(),.info(),.describe()). - Analysis of
schoolvariable (found imbalanced and subsequently dropped). - Feature correlation analysis (heatmap) identifying relationships like
Medu/FeduandDalc/Walc. - Detailed analysis of
age,absences, andstudytimeand their relationship with thepassedoutcome.
- Data dictionary review and dataset preview (
-
Data Cleaning & Preprocessing:
- Checked for missing values and duplicates (none found).
- Outlier removal: Students aged 22 were removed based on EDA.
- Encoding:
- Binary encoding for 'yes'/'no' features.
- One-Hot Encoding for other nominal categorical features (
Mjob,Fjob,reason,guardian,sex,address,famsize,Pstatus), usingdrop_first=True.
-
Feature Engineering:
avgEdu: Created by averagingMeduandFedu.student_support: Created by summing binaryfamsupandschoolsup.
-
Feature Reduction:
- Original features used for feature engineering (
Medu,Fedu,famsup,schoolsup) were dropped. - Features deemed not relevant or used only for EDA (
school,abs_cat,study_cat) were dropped. - PCA was performed for analysis but not used for dimensionality reduction in the final models.
- Original features used for feature engineering (
-
Class Balance Assessment:
- The target variable
passedwas found to be imbalanced (67% Pass, 33% Fail). - SMOTE (Synthetic Minority Over-sampling Technique) was chosen to address this during model training.
- The target variable
-
Model Training, Tuning & Evaluation:
- Train-Test Split: Data was split into training (70%) and testing (30%) sets, stratified by the target variable.
- Pipelines:
ImbPipelinefromimblearnwas used, incorporating SMOTE followed by a classifier. - Classifiers Evaluated:
- Decision Tree
- Logistic Regression (with StandardScaler)
- K-Nearest Neighbors (KNN) (with StandardScaler)
- Support Vector Classifier (SVC) (with StandardScaler)
- Random Forest
- MLPClassifier (Neural Network) (with StandardScaler)
- Initial Evaluation: Models were first evaluated with default parameters.
- Hyperparameter Tuning:
GridSearchCVwas used with 10-fold cross-validation, optimizing for F1-score. This was repeated 20 times for robust evaluation metrics for tuned models. - Evaluation Metrics: Accuracy, Precision, Recall, F1-Score, and AUC (Area Under ROC Curve). Confusion matrices and ROC curves were plotted.
-
Feature Importance Analysis:
- Analyzed for Decision Tree, Random Forest (using feature_importances_) and Logistic Regression (using coefficients) based on models trained on the full dataset (before tuning).
-
Model Demonstration:
- A random student's data was used to demonstrate predictions across all trained (tuned) models.
-
Comparative Analysis with Original Article:
- The project's methodology (SETUP C: no prior grades used) and results were compared to the original Cortez & Silva (2008) study.
- EDA Insights:
- Past
failuresandabsenceswere identified as potentially strong predictors. studytimeshowed a stronger correlation with success thanabsences.- Parental education (
Medu,Fedu) showed a moderate correlation and was combined intoavgEdu.
- Past
- Model Performance (After Tuning & SMOTE):
- Tuning did not consistently improve performance over default models and, in many cases, led to a slight decrease in F1-score and Recall.
- The best performing models (based on F1-score and recall on cross-validation) were:
- Random Forest: F1-Score ≈ 0.753, Recall ≈ 0.785
- SVC: F1-Score ≈ 0.756, Recall ≈ 0.830
- Logistic Regression showed a notable improvement in AUC (+13.17%) after tuning, despite a drop in F1-score.
- Feature Importance:
- Across models (Decision Tree, Random Forest, Logistic Regression),
failureswas consistently the most important predictor. - Other significant features included
absences, social activity (goout), average parental education (avgEdu), and whether the student wants to pursuehighereducation.
- Across models (Decision Tree, Random Forest, Logistic Regression),
- Comparison with Original Article (Cortez & Silva, 2008 - SETUP C):
- The current analysis aligns with the "SETUP C" methodology (no prior grades).
- Consistent Findings: Both studies highlighted
failuresandabsencesas top predictors. - Performance: Model accuracies were slightly lower than in the original paper but followed similar patterns (Random Forest and SVM performing well, around 65-70% accuracy in the current study vs. ~70% PCC in the article).
- Ensure you have Python 3.x installed.
- Install Jupyter Notebook or JupyterLab.
- Install the required libraries:
pip install numpy pandas matplotlib seaborn scikit-learn imbalanced-learn
- Place the
student-data.csvfile in the same directory as the notebook. - Open and run the
stu_inte_sys.ipynbnotebook.
- Data Manipulation & Analysis: NumPy, Pandas
- Visualization: Matplotlib, Seaborn
- Data Preprocessing & Machine Learning: Scikit-learn (StandardScaler, PCA, train_test_split, GridSearchCV, StratifiedKFold, various classifiers and metrics)
- Resampling: Imbalanced-learn (SMOTE, ImbPipeline)
- Built-in: random, time
- Cortez, P., & Silva, A. M. G. (2008). Using Data Mining to Predict Secondary School Student Performance. University of Minho.
- Russell, S. J., & Norvig, P. (2020). Artificial Intelligence: A Modern Approach (4th ed.).
- Hurbans, R. (2020). Grokking Artificial Intelligence Algorithms.
- Gallatin, K., & Albon, C. (2023). Machine Learning with Python Cookbook (2nd ed.).
- Documentation for Pandas, Scikit-learn, Matplotlib, and Seaborn.
