Predict Loan Default with Decision Trees

Image
Predict Loan Default with Decision Trees: A Step-by-Step Guide Introduction Predicting loan default is a critical task in finance analytics. Banks and financial institutions need to assess the risk associated with lending money to individuals or businesses. By accurately predicting whether a borrower will default, lenders can minimize losses and make informed decisions. In this tutorial, we will use a Decision Tree classifier to predict loan default risk. Decision Trees are intuitive, interpretable, and effective for classification tasks, making them ideal for beginners in finance analytics. We will use the Kaggle Loan Prediction dataset , which contains historical loan data with features like income, credit history, and loan amount. This dataset is well-suited for this tutorial because it includes both numerical and categorical features, allowing us to practice data cleaning, feature encoding, and model evaluation. By the end of this tutorial, you will learn: How to load and ...

Build a Loan Approval Predictor Using Python and Machine Learning

Introduction

In the modern financial landscape, banks and lending institutions process thousands of loan applications daily. Making accurate loan approval decisions is crucial for minimizing financial risk while ensuring deserving applicants receive funding. This is where machine learning comes to the rescue!

In this beginner-friendly tutorial, we'll build a loan approval predictor using Python and popular machine learning libraries. You'll learn how to analyze financial data, preprocess it effectively, and create a classification model that can predict whether a loan application should be approved or rejected.

What You'll Learn

  • How to handle real-world finance datasets
  • Essential data preprocessing techniques
  • Building and evaluating classification models
  • Using Python, Pandas, and Scikit-learn for machine learning

The Dataset

We'll be working with the Kaggle Loan Approval dataset, which contains information about loan applicants including their income, credit history, property area, and loan approval status. This dataset is perfect for beginners as it's relatively small and contains common real-world data challenges.

Prerequisites

Before we dive in, make sure you have:

  • Basic Python knowledge
  • Python installed with the following libraries:
    • pandas
    • numpy
    • scikit-learn
    • matplotlib
    • seaborn

You can install these packages using pip:

pip install pandas numpy scikit-learn matplotlib seaborn

Step 1: Setting Up and Loading the Data

Let's start by importing the necessary libraries and loading our dataset.

# Import required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# Set plotting style
plt.style.use('default')
sns.set_palette("husl")

# Load the dataset
# Download from: https://www.kaggle.com/datasets/ninzaami/loan-predication
df = pd.read_csv('loan_approval_dataset.csv')

# Display basic information about the dataset
print("Dataset Shape:", df.shape)
print("\nFirst 5 rows:")
print(df.head())
# Get basic information about the dataset
print("\nDataset Info:")
print(df.info())
print("\nMissing Values:")
print(df.isnull().sum())

Step 2: Exploratory Data Analysis (EDA)

Understanding your data is crucial in machine learning. Let's explore the dataset to uncover patterns and insights.

# Check the distribution of loan approval status
plt.figure(figsize=(8, 6))
df['Loan_Status'].value_counts().plot(kind='bar', color=['skyblue', 'lightcoral'])
plt.title('Distribution of Loan Approval Status')
plt.xlabel('Loan Status')
plt.ylabel('Count')
plt.xticks(rotation=0)
plt.show()

print("Loan Approval Distribution:")
print(df['Loan_Status'].value_counts(normalize=True))

Distribution of loan approvals

# Analyze numerical features
numerical_features = ['ApplicantIncome', 'CoapplicantIncome', 'LoanAmount', 'Loan_Amount_Term']

fig, axes = plt.subplots(2, 2, figsize=(15, 10))
fig.suptitle('Distribution of Numerical Features')

for i, feature in enumerate(numerical_features):
    ax = axes[i//2, i%2]
    df[feature].hist(bins=30, ax=ax, alpha=0.7)
    ax.set_title(f'Distribution of {feature}')
    ax.set_xlabel(feature)
    ax.set_ylabel('Frequency')

plt.tight_layout()
plt.show()
# Analyze categorical features and their relationship with loan approval
categorical_features = ['Gender', 'Married', 'Dependents', 'Education', 'Self_Employed', 'Property_Area']

fig, axes = plt.subplots(2, 3, figsize=(18, 12))
fig.suptitle('Loan Approval Rate by Categorical Features')

for i, feature in enumerate(categorical_features):
    ax = axes[i//3, i%3]
    
    # Calculate approval rate for each category
    approval_rate = df.groupby(feature)['Loan_Status'].apply(lambda x: (x=='Y').sum()/len(x))
    approval_rate.plot(kind='bar', ax=ax, color='lightgreen')
    
    ax.set_title(f'Loan Approval Rate by {feature}')
    ax.set_xlabel(feature)
    ax.set_ylabel('Approval Rate')
    ax.tick_params(axis='x', rotation=45)

plt.tight_layout()
plt.show()

Categorical features analysis

Step 3: Data Preprocessing

Real-world finance data often contains missing values and inconsistencies. Let's clean and prepare our data for machine learning.

3.1 Handling Missing Values

# Check missing values in detail
missing_data = df.isnull().sum()
missing_percentage = (missing_data / len(df)) * 100

missing_info = pd.DataFrame({
    'Missing Count': missing_data,
    'Percentage': missing_percentage
})

print("Missing Data Summary:")
print(missing_info[missing_info['Missing Count'] > 0])
# Handle missing values
# For categorical variables, use mode (most frequent value)
categorical_columns = ['Gender', 'Married', 'Dependents', 'Self_Employed']
for col in categorical_columns:
    if df[col].isnull().any():
        mode_value = df[col].mode()[0]
        df[col].fillna(mode_value, inplace=True)
        print(f"Filled {col} missing values with: {mode_value}")

# For numerical variables, use median
numerical_columns = ['LoanAmount', 'Loan_Amount_Term']
for col in numerical_columns:
    if df[col].isnull().any():
        median_value = df[col].median()
        df[col].fillna(median_value, inplace=True)
        print(f"Filled {col} missing values with median: {median_value}")

# Verify no missing values remain
print("\nMissing values after preprocessing:")
print(df.isnull().sum().sum())

3.2 Feature Engineering

# Create new features that might be useful for loan approval prediction
# Total Income
df['Total_Income'] = df['ApplicantIncome'] + df['CoapplicantIncome']

# Loan Amount to Income Ratio
df['Loan_Income_Ratio'] = df['LoanAmount'] / df['Total_Income']

# Log transformation to handle skewness
df['Log_LoanAmount'] = np.log(df['LoanAmount'] + 1)
df['Log_Total_Income'] = np.log(df['Total_Income'] + 1)

print("New features created:")
print("- Total_Income")
print("- Loan_Income_Ratio") 
print("- Log_LoanAmount")
print("- Log_Total_Income")

3.3 Encoding Categorical Variables

# Create a copy of the dataframe for preprocessing
df_processed = df.copy()

# Initialize label encoder
le = LabelEncoder()

# Encode categorical variables
categorical_features = ['Gender', 'Married', 'Dependents', 'Education', 
                       'Self_Employed', 'Property_Area', 'Loan_Status']

for feature in categorical_features:
    df_processed[feature] = le.fit_transform(df_processed[feature])
    print(f"Encoded {feature}")

# Display the first few rows of processed data
print("\nProcessed data (first 5 rows):")
print(df_processed.head())

Step 4: Model Building and Training

Now comes the exciting part - building our machine learning model to predict loan approvals!

4.1 Preparing Features and Target

# Define features and target variable
feature_columns = ['Gender', 'Married', 'Dependents', 'Education', 'Self_Employed',
                  'ApplicantIncome', 'CoapplicantIncome', 'LoanAmount', 'Loan_Amount_Term',
                  'Property_Area', 'Total_Income', 'Loan_Income_Ratio', 
                  'Log_LoanAmount', 'Log_Total_Income']

X = df_processed[feature_columns]
y = df_processed['Loan_Status']

print("Features shape:", X.shape)
print("Target shape:", y.shape)
print("\nTarget distribution:")
print(y.value_counts())

4.2 Splitting the Data

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

print(f"Training set size: {X_train.shape[0]}")
print(f"Testing set size: {X_test.shape[0]}")
print(f"Training set loan approval rate: {y_train.mean():.2f}")
print(f"Testing set loan approval rate: {y_test.mean():.2f}")

4.3 Feature Scaling

# Scale numerical features
scaler = StandardScaler()
numerical_features_idx = [X.columns.get_loc(col) for col in 
                         ['ApplicantIncome', 'CoapplicantIncome', 'LoanAmount', 
                          'Loan_Amount_Term', 'Total_Income', 'Loan_Income_Ratio',
                          'Log_LoanAmount', 'Log_Total_Income']]

X_train_scaled = X_train.copy()
X_test_scaled = X_test.copy()

# Fit scaler on training data and transform both sets
X_train_scaled.iloc[:, numerical_features_idx] = scaler.fit_transform(
    X_train.iloc[:, numerical_features_idx]
)
X_test_scaled.iloc[:, numerical_features_idx] = scaler.transform(
    X_test.iloc[:, numerical_features_idx]
)

print("Feature scaling completed!")

4.4 Training Multiple Models

# Initialize models
models = {
    'Logistic Regression': LogisticRegression(random_state=42),
    'Random Forest': RandomForestClassifier(n_estimators=100, random_state=42)
}

# Train and evaluate models
model_results = {}

for name, model in models.items():
    print(f"\nTraining {name}...")
    
    # Train the model
    if name == 'Logistic Regression':
        model.fit(X_train_scaled, y_train)
        y_pred = model.predict(X_test_scaled)
    else:
        model.fit(X_train, y_train)
        y_pred = model.predict(X_test)
    
    # Calculate accuracy
    accuracy = accuracy_score(y_test, y_pred)
    model_results[name] = {
        'model': model,
        'accuracy': accuracy,
        'predictions': y_pred
    }
    
    print(f"{name} Accuracy: {accuracy:.4f}")

Step 5: Model Evaluation

Let's thoroughly evaluate our models to understand their performance.

# Compare model performances
performance_df = pd.DataFrame({
    'Model': list(model_results.keys()),
    'Accuracy': [results['accuracy'] for results in model_results.values()]
})

print("Model Performance Comparison:")
print(performance_df)

# Visualize model performance
plt.figure(figsize=(10, 6))
bars = plt.bar(performance_df['Model'], performance_df['Accuracy'], 
               color=['lightblue', 'lightgreen'])
plt.title('Model Accuracy Comparison')
plt.ylabel('Accuracy')
plt.ylim(0, 1)

# Add accuracy values on top of bars
for bar, accuracy in zip(bars, performance_df['Accuracy']):
    plt.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.01, 
             f'{accuracy:.3f}', ha='center', va='bottom')

plt.show()

Model performance comparison

# Detailed evaluation for the best performing model
best_model_name = performance_df.loc[performance_df['Accuracy'].idxmax(), 'Model']
best_model_results = model_results[best_model_name]

print(f"\nDetailed Evaluation for {best_model_name}:")
print("\nClassification Report:")
print(classification_report(y_test, best_model_results['predictions']))

# Confusion Matrix
cm = confusion_matrix(y_test, best_model_results['predictions'])
plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', 
            xticklabels=['Rejected', 'Approved'],
            yticklabels=['Rejected', 'Approved'])
plt.title(f'Confusion Matrix - {best_model_name}')
plt.ylabel('Actual')
plt.xlabel('Predicted')
plt.show()

Confusion matrix

Feature Importance Analysis

# Analyze feature importance (for Random Forest)
if 'Random Forest' in model_results:
    rf_model = model_results['Random Forest']['model']
    feature_importance = pd.DataFrame({
        'Feature': X.columns,
        'Importance': rf_model.feature_importances_
    }).sort_values('Importance', ascending=False)
    
    print("\nTop 10 Most Important Features:")
    print(feature_importance.head(10))
    
    # Plot feature importance
    plt.figure(figsize=(10, 8))
    sns.barplot(data=feature_importance.head(10), y='Feature', x='Importance')
    plt.title('Top 10 Feature Importance (Random Forest)')
    plt.xlabel('Importance Score')
    plt.tight_layout()
    plt.show()

Feature importance

Step 6: Making Predictions on New Data

Let's create a function to make predictions on new loan applications.

def predict_loan_approval(model, scaler, applicant_data, model_type='Random Forest'):
    """
    Predict loan approval for new applicant data
    
    Parameters:
    model: trained model
    scaler: fitted StandardScaler
    applicant_data: dictionary with applicant information
    model_type: type of model being used
    
    Returns:
    prediction and probability
    """
    
    # Create DataFrame from input data
    input_df = pd.DataFrame([applicant_data])
    
    # Apply the same preprocessing steps
    # Calculate derived features
    input_df['Total_Income'] = input_df['ApplicantIncome'] + input_df['CoapplicantIncome']
    input_df['Loan_Income_Ratio'] = input_df['LoanAmount'] / input_df['Total_Income']
    input_df['Log_LoanAmount'] = np.log(input_df['LoanAmount'] + 1)
    input_df['Log_Total_Income'] = np.log(input_df['Total_Income'] + 1)
    
    # Encode categorical variables (you would need to save the label encoders)
    # For simplicity, assuming numerical encoding is already done
    
    # Select features
    input_features = input_df[feature_columns]
    
    # Scale features if using Logistic Regression
    if model_type == 'Logistic Regression':
        input_features_scaled = input_features.copy()
        input_features_scaled.iloc[:, numerical_features_idx] = scaler.transform(
            input_features.iloc[:, numerical_features_idx]
        )
        prediction = model.predict(input_features_scaled)[0]
        probability = model.predict_proba(input_features_scaled)[0]
    else:
        prediction = model.predict(input_features)[0]
        probability = model.predict_proba(input_features)[0]
    
    return prediction, probability

# Example usage
sample_applicant = {
    'Gender': 1,  # Male
    'Married': 1,  # Yes
    'Dependents': 0,  # 0
    'Education': 1,  # Graduate
    'Self_Employed': 0,  # No
    'ApplicantIncome': 5000,
    'CoapplicantIncome': 2000,
    'LoanAmount': 150,
    'Loan_Amount_Term': 360,
    'Property_Area': 1  # Urban
}

best_model = model_results[best_model_name]['model']
prediction, probability = predict_loan_approval(best_model, scaler, sample_applicant, best_model_name)

print(f"\nSample Prediction:")
print(f"Loan Status: {'Approved' if prediction == 1 else 'Rejected'}")
print(f"Approval Probability: {probability[1]:.3f}")
print(f"Rejection Probability: {probability[0]:.3f}")

Step 7: Model Optimization (Optional)

For those wanting to take their model further, here's how to optimize performance:

from sklearn.model_selection import GridSearchCV

# Hyperparameter tuning for Random Forest
param_grid = {
    'n_estimators': [50, 100, 200],
    'max_depth': [3, 5, 7, None],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}

# Perform grid search
rf_grid = GridSearchCV(
    RandomForestClassifier(random_state=42),
    param_grid,
    cv=5,
    scoring='accuracy',
    n_jobs=-1
)

print("Performing hyperparameter tuning...")
rf_grid.fit(X_train, y_train)

print(f"Best parameters: {rf_grid.best_params_}")
print(f"Best cross-validation score: {rf_grid.best_score_:.4f}")

# Evaluate optimized model
optimized_predictions = rf_grid.predict(X_test)
optimized_accuracy = accuracy_score(y_test, optimized_predictions)
print(f"Optimized model accuracy: {optimized_accuracy:.4f}")

Conclusion

Congratulations! You've successfully built a loan approval predictor using Python and machine learning. Here's what you accomplished:

Key Achievements:

  • ✅ Loaded and explored a real-world finance dataset
  • ✅ Performed comprehensive data preprocessing and feature engineering
  • ✅ Built and compared multiple classification models
  • ✅ Evaluated model performance using various metrics
  • ✅ Created a function to make predictions on new data

Key Insights:

  1. Data preprocessing is crucial for model performance
  2. Feature engineering can significantly improve predictions
  3. Different algorithms may perform differently on the same dataset
  4. Model evaluation should go beyond just accuracy

Model Performance Summary:

Our models achieved good accuracy in predicting loan approvals, with Random Forest typically performing slightly better due to its ability to handle non-linear relationships and feature interactions.

Next Steps and Further Learning

Immediate Improvements:

  1. Handle class imbalance using techniques like SMOTE
  2. Cross-validation for more robust model evaluation
  3. Feature selection to identify the most predictive variables
  4. Ensemble methods combining multiple models

Advanced Techniques:

# Example: Handling class imbalance with SMOTE
from imblearn.over_sampling import SMOTE

smote = SMOTE(random_state=42)
X_train_balanced, y_train_balanced = smote.fit_resample(X_train, y_train)

print(f"Original training set: {y_train.value_counts()}")
print(f"Balanced training set: {pd.Series(y_train_balanced).value_counts()}")

Learning Resources:

  1. Scikit-learn Documentation: Comprehensive guide to ML algorithms
  2. Kaggle Learn: Free micro-courses on machine learning
  3. Financial ML Books: "Advances in Financial Machine Learning" by Marcos López de Prado
  4. Practice Datasets: Explore more finance datasets on Kaggle

Real-World Applications:

  • Credit scoring systems
  • Insurance claim prediction
  • Fraud detection
  • Investment risk assessment

Deployment Considerations:

  • Model interpretability for regulatory compliance
  • Regular model retraining with new data
  • A/B testing for model updates
  • Monitoring for model drift

Keep practicing with different datasets and algorithms to strengthen your machine learning skills in finance. The field of financial ML is rapidly evolving, offering exciting opportunities for data scientists and analysts!

Remember: The key to mastering machine learning is consistent practice and staying curious about new techniques and applications. Happy coding! 🚀

Comments

Popular posts from this blog

Predict Stock Prices with LSTM Networks

Predict Customer Churn with Random Forest