Predict Loan Default with Decision Trees: A Step-by-Step Guide

Introduction

Predicting loan default is a critical task in finance analytics. Banks and financial institutions need to assess the risk associated with lending money to individuals or businesses. By accurately predicting whether a borrower will default, lenders can minimize losses and make informed decisions.

In this tutorial, we will use a Decision Tree classifier to predict loan default risk. Decision Trees are intuitive, interpretable, and effective for classification tasks, making them ideal for beginners in finance analytics.

We will use the Kaggle Loan Prediction dataset, which contains historical loan data with features like income, credit history, and loan amount. This dataset is well-suited for this tutorial because it includes both numerical and categorical features, allowing us to practice data cleaning, feature encoding, and model evaluation.

By the end of this tutorial, you will learn:

How to load and explore a loan dataset.
How to preprocess data for machine learning.
How to train and evaluate a Decision Tree model.
How to interpret the model’s predictions.

Let’s get started!

Step 1: Setting Up the Environment

Before we begin, ensure you have the following Python libraries installed:

pandas for data manipulation.
numpy for numerical operations.
scikit-learn for machine learning.
matplotlib and seaborn for data visualization.

You can install these libraries using pip:

pip install pandas numpy scikit-learn matplotlib seaborn

Step 2: Loading and Exploring the Dataset

Loading the Dataset

First, let’s load the dataset. You can download the Kaggle Loan Prediction dataset from Kaggle. For this tutorial, we assume the dataset is saved as loan_data.csv.

import pandas as pd

# Load the dataset
data = pd.read_csv('loan_data.csv')
print(data.head())

Exploring the Dataset

Let’s explore the dataset to understand its structure and features.

# Display basic information about the dataset
print(data.info())

# Display summary statistics
print(data.describe())

Understanding the Features

The dataset typically includes the following features:

Loan_ID: Unique identifier for each loan.
Gender: Male or Female.
Married: Whether the applicant is married (Yes/No).
Dependents: Number of dependents.
Education: Graduate or Not Graduate.
Self_Employed: Whether the applicant is self-employed (Yes/No).
ApplicantIncome: Income of the applicant.
CoapplicantIncome: Income of the co-applicant.
LoanAmount: Loan amount requested.
Loan_Amount_Term: Term of the loan in months.
Credit_History: Credit history (1 for good, 0 for bad).
Property_Area: Urban, Semi-Urban, or Rural.
Loan_Status: Target variable (Y for approved, N for rejected).

Step 3: Data Cleaning and Preprocessing

Handling Missing Values

Missing values can affect the performance of our model. Let’s check for missing values and handle them appropriately.

# Check for missing values
print(data.isnull().sum())

# Fill missing values for categorical features with the mode
data['Gender'].fillna(data['Gender'].mode()[0], inplace=True)
data['Married'].fillna(data['Married'].mode()[0], inplace=True)
data['Dependents'].fillna(data['Dependents'].mode()[0], inplace=True)
data['Self_Employed'].fillna(data['Self_Employed'].mode()[0], inplace=True)

# Fill missing values for numerical features with the median
data['LoanAmount'].fillna(data['LoanAmount'].median(), inplace=True)
data['Loan_Amount_Term'].fillna(data['Loan_Amount_Term'].median(), inplace=True)
data['Credit_History'].fillna(data['Credit_History'].median(), inplace=True)

Encoding Categorical Features

Machine learning models require numerical input. Let’s encode categorical features using Label Encoding or One-Hot Encoding.

from sklearn.preprocessing import LabelEncoder

# Initialize LabelEncoder
label_encoder = LabelEncoder()

# Encode categorical features
data['Gender'] = label_encoder.fit_transform(data['Gender'])
data['Married'] = label_encoder.fit_transform(data['Married'])
data['Education'] = label_encoder.fit_transform(data['Education'])
data['Self_Employed'] = label_encoder.fit_transform(data['Self_Employed'])
data['Property_Area'] = label_encoder.fit_transform(data['Property_Area'])
data['Loan_Status'] = label_encoder.fit_transform(data['Loan_Status'])

Feature Scaling

Feature scaling is not strictly necessary for Decision Trees, but it can help with visualization and other models. For this tutorial, we will skip scaling.

Step 4: Exploratory Data Analysis (EDA)

Visualizing the Target Variable

Let’s visualize the distribution of the target variable (Loan_Status).

import matplotlib.pyplot as plt
import seaborn as sns

# Plot the distribution of Loan_Status
sns.countplot(x='Loan_Status', data=data)
plt.title('Distribution of Loan Status')
plt.show()

Loan Status Distribution

From the plot, we can see the distribution of approved and rejected loans. This helps us understand the class balance in our dataset.

Visualizing Numerical Features

Let’s visualize the distribution of numerical features like ApplicantIncome and LoanAmount.

# Plot the distribution of ApplicantIncome
sns.histplot(data['ApplicantIncome'], kde=True)
plt.title('Distribution of Applicant Income')
plt.show()

Applicant Income Distribution

Visualizing Categorical Features

Let’s visualize the relationship between categorical features and the target variable.

# Plot the relationship between Education and Loan_Status
sns.countplot(x='Education', hue='Loan_Status', data=data)
plt.title('Loan Status by Education')
plt.show()

Loan Status by Education

Step 5: Training the Decision Tree Model

Splitting the Dataset

We will split the dataset into training and testing sets to evaluate the model’s performance.

from sklearn.model_selection import train_test_split

# Define features and target
X = data.drop(['Loan_ID', 'Loan_Status'], axis=1)
y = data['Loan_Status']

# Split the dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Training the Model

Now, let’s train a Decision Tree classifier.

from sklearn.tree import DecisionTreeClassifier

# Initialize the Decision Tree classifier
dt_classifier = DecisionTreeClassifier(random_state=42)

# Train the model
dt_classifier.fit(X_train, y_train)

Evaluating the Model

Let’s evaluate the model’s performance using accuracy, precision, recall, and the confusion matrix.

from sklearn.metrics import accuracy_score, precision_score, recall_score, confusion_matrix

# Make predictions
y_pred = dt_classifier.predict(X_test)

# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy: {accuracy:.2f}')

# Calculate precision
precision = precision_score(y_test, y_pred)
print(f'Precision: {precision:.2f}')

# Calculate recall
recall = recall_score(y_test, y_pred)
print(f'Recall: {recall:.2f}')

# Confusion matrix
conf_matrix = confusion_matrix(y_test, y_pred)
print('Confusion Matrix:')
print(conf_matrix)

Visualizing the Decision Tree

Let’s visualize the Decision Tree to understand how it makes decisions.

from sklearn.tree import plot_tree

# Plot the Decision Tree
plt.figure(figsize=(20, 10))
plot_tree(dt_classifier, feature_names=X.columns, class_names=['Rejected', 'Approved'], filled=True)
plt.show()

Decision Tree Visualization

Step 6: Improving the Model

Hyperparameter Tuning

We can improve the model’s performance by tuning hyperparameters like max_depth, min_samples_split, and min_samples_leaf.

from sklearn.model_selection import GridSearchCV

# Define the parameter grid
param_grid = {
    'max_depth': [3, 5, 7, 10],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}

# Initialize GridSearchCV
grid_search = GridSearchCV(DecisionTreeClassifier(random_state=42), param_grid, cv=5)

# Perform grid search
grid_search.fit(X_train, y_train)

# Get the best parameters
best_params = grid_search.best_params_
print(f'Best Parameters: {best_params}')

Training the Improved Model

Let’s train the model with the best parameters.

# Initialize the improved Decision Tree classifier
improved_dt_classifier = DecisionTreeClassifier(
    max_depth=best_params['max_depth'],
    min_samples_split=best_params['min_samples_split'],
    min_samples_leaf=best_params['min_samples_leaf'],
    random_state=42
)

# Train the improved model
improved_dt_classifier.fit(X_train, y_train)

# Evaluate the improved model
y_pred_improved = improved_dt_classifier.predict(X_test)
accuracy_improved = accuracy_score(y_test, y_pred_improved)
print(f'Improved Accuracy: {accuracy_improved:.2f}')

Conclusion and Next Steps

In this tutorial, we learned how to predict loan default risk using a Decision Tree classifier. We covered the following steps:

Loading and exploring the dataset.
Cleaning and preprocessing the data.
Performing exploratory data analysis (EDA).
Training and evaluating a Decision Tree model.
Improving the model through hyperparameter tuning.

Key Insights

Decision Trees are intuitive and effective for classification tasks.
Data cleaning and preprocessing are crucial for model performance.
Hyperparameter tuning can significantly improve model accuracy.

Next Steps

Experiment with other machine learning models like Random Forest or Logistic Regression.
Explore feature engineering techniques to create new features.
Deploy the model as a web application using Flask or Django.

This tutorial provides a solid foundation for predicting loan default risk. By following these steps, you can build and evaluate your own models for finance analytics. Happy coding!

Search This Blog

AI Mentor Lab