Loan Approval ML

Classification models using Spark ML.

Problem Statement

Financial institutions face the dual challenge of minimizing the risk of loan defaults while efficiently processing legitimate applications. The goal was to build a predictive model to automate the approval process with high accuracy.

My Role

  • Performed Exploratory Data Analysis (EDA) to understand feature distributions.
  • Engineered features and handled class imbalances.
  • Trained and evaluated multiple classification models using Spark MLlib on Databricks.

Tools & Tech

Spark Python Machine Learning Databricks

Dataset / Inputs

Historical loan application data including applicant demographics, credit history, income levels, loan amount, and repayment status.

Approach

1. Data Preprocessing

Handled missing values, encoded categorical variables (One-Hot Encoding), and scaled numerical features using VectorAssembler.

2. Model Selection

Trained Logistic Regression, Random Forest, and Gradient Boosted Tree models to compare performance.

3. Evaluation

Assessed models using AUC-ROC, Accuracy, and Confusion Matrices to minimize false positives and false negatives.

Key Insights

  • Credit History and Debt-to-Income Ratio were the most significant predictors of default.
  • Random Forest Classifier achieved the highest accuracy (approx. 85%) and generalization capability.
  • Feature importance analysis revealed that applicant income had less impact than expected compared to credit behavior.

Business Impact

  • Risk Reduction: Potential to reduce default rates by ~15% through better prediction.
  • Efficiency: Automating initial screening reduces manual review time significantly.
  • Consistency: Removes human bias from the initial approval decision process.