Build a Loan Approval Predictor Using Python and Machine Learning
- Get link
- X
- Other Apps
Introduction
In the modern financial landscape, banks and lending institutions process thousands of loan applications daily. Making accurate loan approval decisions is crucial for minimizing financial risk while ensuring deserving applicants receive funding. This is where machine learning comes to the rescue!
In this beginner-friendly tutorial, we'll build a loan approval predictor using Python and popular machine learning libraries. You'll learn how to analyze financial data, preprocess it effectively, and create a classification model that can predict whether a loan application should be approved or rejected.
What You'll Learn
- How to handle real-world finance datasets
- Essential data preprocessing techniques
- Building and evaluating classification models
- Using Python, Pandas, and Scikit-learn for machine learning
The Dataset
We'll be working with the Kaggle Loan Approval dataset, which contains information about loan applicants including their income, credit history, property area, and loan approval status. This dataset is perfect for beginners as it's relatively small and contains common real-world data challenges.
Prerequisites
Before we dive in, make sure you have:
- Basic Python knowledge
-
Python installed with the following libraries:
- pandas
- numpy
- scikit-learn
- matplotlib
- seaborn
You can install these packages using pip:
pip install pandas numpy scikit-learn matplotlib seaborn
Step 1: Setting Up and Loading the Data
Let's start by importing the necessary libraries and loading our dataset.
# Import required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
# Set plotting style
plt.style.use('default')
sns.set_palette("husl")
# Load the dataset
# Download from: https://www.kaggle.com/datasets/ninzaami/loan-predication
df = pd.read_csv('loan_approval_dataset.csv')
# Display basic information about the dataset
print("Dataset Shape:", df.shape)
print("\nFirst 5 rows:")
print(df.head())
# Get basic information about the dataset
print("\nDataset Info:")
print(df.info())
print("\nMissing Values:")
print(df.isnull().sum())
Step 2: Exploratory Data Analysis (EDA)
Understanding your data is crucial in machine learning. Let's explore the dataset to uncover patterns and insights.
# Check the distribution of loan approval status
plt.figure(figsize=(8, 6))
df['Loan_Status'].value_counts().plot(kind='bar', color=['skyblue', 'lightcoral'])
plt.title('Distribution of Loan Approval Status')
plt.xlabel('Loan Status')
plt.ylabel('Count')
plt.xticks(rotation=0)
plt.show()
print("Loan Approval Distribution:")
print(df['Loan_Status'].value_counts(normalize=True))

# Analyze numerical features
numerical_features = ['ApplicantIncome', 'CoapplicantIncome', 'LoanAmount', 'Loan_Amount_Term']
fig, axes = plt.subplots(2, 2, figsize=(15, 10))
fig.suptitle('Distribution of Numerical Features')
for i, feature in enumerate(numerical_features):
ax = axes[i//2, i%2]
df[feature].hist(bins=30, ax=ax, alpha=0.7)
ax.set_title(f'Distribution of {feature}')
ax.set_xlabel(feature)
ax.set_ylabel('Frequency')
plt.tight_layout()
plt.show()
# Analyze categorical features and their relationship with loan approval
categorical_features = ['Gender', 'Married', 'Dependents', 'Education', 'Self_Employed', 'Property_Area']
fig, axes = plt.subplots(2, 3, figsize=(18, 12))
fig.suptitle('Loan Approval Rate by Categorical Features')
for i, feature in enumerate(categorical_features):
ax = axes[i//3, i%3]
# Calculate approval rate for each category
approval_rate = df.groupby(feature)['Loan_Status'].apply(lambda x: (x=='Y').sum()/len(x))
approval_rate.plot(kind='bar', ax=ax, color='lightgreen')
ax.set_title(f'Loan Approval Rate by {feature}')
ax.set_xlabel(feature)
ax.set_ylabel('Approval Rate')
ax.tick_params(axis='x', rotation=45)
plt.tight_layout()
plt.show()
Step 3: Data Preprocessing
Real-world finance data often contains missing values and inconsistencies. Let's clean and prepare our data for machine learning.
3.1 Handling Missing Values
# Check missing values in detail
missing_data = df.isnull().sum()
missing_percentage = (missing_data / len(df)) * 100
missing_info = pd.DataFrame({
'Missing Count': missing_data,
'Percentage': missing_percentage
})
print("Missing Data Summary:")
print(missing_info[missing_info['Missing Count'] > 0])
# Handle missing values
# For categorical variables, use mode (most frequent value)
categorical_columns = ['Gender', 'Married', 'Dependents', 'Self_Employed']
for col in categorical_columns:
if df[col].isnull().any():
mode_value = df[col].mode()[0]
df[col].fillna(mode_value, inplace=True)
print(f"Filled {col} missing values with: {mode_value}")
# For numerical variables, use median
numerical_columns = ['LoanAmount', 'Loan_Amount_Term']
for col in numerical_columns:
if df[col].isnull().any():
median_value = df[col].median()
df[col].fillna(median_value, inplace=True)
print(f"Filled {col} missing values with median: {median_value}")
# Verify no missing values remain
print("\nMissing values after preprocessing:")
print(df.isnull().sum().sum())
3.2 Feature Engineering
# Create new features that might be useful for loan approval prediction
# Total Income
df['Total_Income'] = df['ApplicantIncome'] + df['CoapplicantIncome']
# Loan Amount to Income Ratio
df['Loan_Income_Ratio'] = df['LoanAmount'] / df['Total_Income']
# Log transformation to handle skewness
df['Log_LoanAmount'] = np.log(df['LoanAmount'] + 1)
df['Log_Total_Income'] = np.log(df['Total_Income'] + 1)
print("New features created:")
print("- Total_Income")
print("- Loan_Income_Ratio")
print("- Log_LoanAmount")
print("- Log_Total_Income")
3.3 Encoding Categorical Variables
# Create a copy of the dataframe for preprocessing
df_processed = df.copy()
# Initialize label encoder
le = LabelEncoder()
# Encode categorical variables
categorical_features = ['Gender', 'Married', 'Dependents', 'Education',
'Self_Employed', 'Property_Area', 'Loan_Status']
for feature in categorical_features:
df_processed[feature] = le.fit_transform(df_processed[feature])
print(f"Encoded {feature}")
# Display the first few rows of processed data
print("\nProcessed data (first 5 rows):")
print(df_processed.head())
Step 4: Model Building and Training
Now comes the exciting part - building our machine learning model to predict loan approvals!
4.1 Preparing Features and Target
# Define features and target variable
feature_columns = ['Gender', 'Married', 'Dependents', 'Education', 'Self_Employed',
'ApplicantIncome', 'CoapplicantIncome', 'LoanAmount', 'Loan_Amount_Term',
'Property_Area', 'Total_Income', 'Loan_Income_Ratio',
'Log_LoanAmount', 'Log_Total_Income']
X = df_processed[feature_columns]
y = df_processed['Loan_Status']
print("Features shape:", X.shape)
print("Target shape:", y.shape)
print("\nTarget distribution:")
print(y.value_counts())
4.2 Splitting the Data
# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y
)
print(f"Training set size: {X_train.shape[0]}")
print(f"Testing set size: {X_test.shape[0]}")
print(f"Training set loan approval rate: {y_train.mean():.2f}")
print(f"Testing set loan approval rate: {y_test.mean():.2f}")
4.3 Feature Scaling
# Scale numerical features
scaler = StandardScaler()
numerical_features_idx = [X.columns.get_loc(col) for col in
['ApplicantIncome', 'CoapplicantIncome', 'LoanAmount',
'Loan_Amount_Term', 'Total_Income', 'Loan_Income_Ratio',
'Log_LoanAmount', 'Log_Total_Income']]
X_train_scaled = X_train.copy()
X_test_scaled = X_test.copy()
# Fit scaler on training data and transform both sets
X_train_scaled.iloc[:, numerical_features_idx] = scaler.fit_transform(
X_train.iloc[:, numerical_features_idx]
)
X_test_scaled.iloc[:, numerical_features_idx] = scaler.transform(
X_test.iloc[:, numerical_features_idx]
)
print("Feature scaling completed!")
4.4 Training Multiple Models
# Initialize models
models = {
'Logistic Regression': LogisticRegression(random_state=42),
'Random Forest': RandomForestClassifier(n_estimators=100, random_state=42)
}
# Train and evaluate models
model_results = {}
for name, model in models.items():
print(f"\nTraining {name}...")
# Train the model
if name == 'Logistic Regression':
model.fit(X_train_scaled, y_train)
y_pred = model.predict(X_test_scaled)
else:
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
model_results[name] = {
'model': model,
'accuracy': accuracy,
'predictions': y_pred
}
print(f"{name} Accuracy: {accuracy:.4f}")
Step 5: Model Evaluation
Let's thoroughly evaluate our models to understand their performance.
# Compare model performances
performance_df = pd.DataFrame({
'Model': list(model_results.keys()),
'Accuracy': [results['accuracy'] for results in model_results.values()]
})
print("Model Performance Comparison:")
print(performance_df)
# Visualize model performance
plt.figure(figsize=(10, 6))
bars = plt.bar(performance_df['Model'], performance_df['Accuracy'],
color=['lightblue', 'lightgreen'])
plt.title('Model Accuracy Comparison')
plt.ylabel('Accuracy')
plt.ylim(0, 1)
# Add accuracy values on top of bars
for bar, accuracy in zip(bars, performance_df['Accuracy']):
plt.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.01,
f'{accuracy:.3f}', ha='center', va='bottom')
plt.show()

# Detailed evaluation for the best performing model
best_model_name = performance_df.loc[performance_df['Accuracy'].idxmax(), 'Model']
best_model_results = model_results[best_model_name]
print(f"\nDetailed Evaluation for {best_model_name}:")
print("\nClassification Report:")
print(classification_report(y_test, best_model_results['predictions']))
# Confusion Matrix
cm = confusion_matrix(y_test, best_model_results['predictions'])
plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues',
xticklabels=['Rejected', 'Approved'],
yticklabels=['Rejected', 'Approved'])
plt.title(f'Confusion Matrix - {best_model_name}')
plt.ylabel('Actual')
plt.xlabel('Predicted')
plt.show()

Feature Importance Analysis
# Analyze feature importance (for Random Forest)
if 'Random Forest' in model_results:
rf_model = model_results['Random Forest']['model']
feature_importance = pd.DataFrame({
'Feature': X.columns,
'Importance': rf_model.feature_importances_
}).sort_values('Importance', ascending=False)
print("\nTop 10 Most Important Features:")
print(feature_importance.head(10))
# Plot feature importance
plt.figure(figsize=(10, 8))
sns.barplot(data=feature_importance.head(10), y='Feature', x='Importance')
plt.title('Top 10 Feature Importance (Random Forest)')
plt.xlabel('Importance Score')
plt.tight_layout()
plt.show()

Step 6: Making Predictions on New Data
Let's create a function to make predictions on new loan applications.
def predict_loan_approval(model, scaler, applicant_data, model_type='Random Forest'):
"""
Predict loan approval for new applicant data
Parameters:
model: trained model
scaler: fitted StandardScaler
applicant_data: dictionary with applicant information
model_type: type of model being used
Returns:
prediction and probability
"""
# Create DataFrame from input data
input_df = pd.DataFrame([applicant_data])
# Apply the same preprocessing steps
# Calculate derived features
input_df['Total_Income'] = input_df['ApplicantIncome'] + input_df['CoapplicantIncome']
input_df['Loan_Income_Ratio'] = input_df['LoanAmount'] / input_df['Total_Income']
input_df['Log_LoanAmount'] = np.log(input_df['LoanAmount'] + 1)
input_df['Log_Total_Income'] = np.log(input_df['Total_Income'] + 1)
# Encode categorical variables (you would need to save the label encoders)
# For simplicity, assuming numerical encoding is already done
# Select features
input_features = input_df[feature_columns]
# Scale features if using Logistic Regression
if model_type == 'Logistic Regression':
input_features_scaled = input_features.copy()
input_features_scaled.iloc[:, numerical_features_idx] = scaler.transform(
input_features.iloc[:, numerical_features_idx]
)
prediction = model.predict(input_features_scaled)[0]
probability = model.predict_proba(input_features_scaled)[0]
else:
prediction = model.predict(input_features)[0]
probability = model.predict_proba(input_features)[0]
return prediction, probability
# Example usage
sample_applicant = {
'Gender': 1, # Male
'Married': 1, # Yes
'Dependents': 0, # 0
'Education': 1, # Graduate
'Self_Employed': 0, # No
'ApplicantIncome': 5000,
'CoapplicantIncome': 2000,
'LoanAmount': 150,
'Loan_Amount_Term': 360,
'Property_Area': 1 # Urban
}
best_model = model_results[best_model_name]['model']
prediction, probability = predict_loan_approval(best_model, scaler, sample_applicant, best_model_name)
print(f"\nSample Prediction:")
print(f"Loan Status: {'Approved' if prediction == 1 else 'Rejected'}")
print(f"Approval Probability: {probability[1]:.3f}")
print(f"Rejection Probability: {probability[0]:.3f}")
Step 7: Model Optimization (Optional)
For those wanting to take their model further, here's how to optimize performance:
from sklearn.model_selection import GridSearchCV
# Hyperparameter tuning for Random Forest
param_grid = {
'n_estimators': [50, 100, 200],
'max_depth': [3, 5, 7, None],
'min_samples_split': [2, 5, 10],
'min_samples_leaf': [1, 2, 4]
}
# Perform grid search
rf_grid = GridSearchCV(
RandomForestClassifier(random_state=42),
param_grid,
cv=5,
scoring='accuracy',
n_jobs=-1
)
print("Performing hyperparameter tuning...")
rf_grid.fit(X_train, y_train)
print(f"Best parameters: {rf_grid.best_params_}")
print(f"Best cross-validation score: {rf_grid.best_score_:.4f}")
# Evaluate optimized model
optimized_predictions = rf_grid.predict(X_test)
optimized_accuracy = accuracy_score(y_test, optimized_predictions)
print(f"Optimized model accuracy: {optimized_accuracy:.4f}")
Conclusion
Congratulations! You've successfully built a loan approval predictor using Python and machine learning. Here's what you accomplished:
Key Achievements:
- ✅ Loaded and explored a real-world finance dataset
- ✅ Performed comprehensive data preprocessing and feature engineering
- ✅ Built and compared multiple classification models
- ✅ Evaluated model performance using various metrics
- ✅ Created a function to make predictions on new data
Key Insights:
- Data preprocessing is crucial for model performance
- Feature engineering can significantly improve predictions
- Different algorithms may perform differently on the same dataset
- Model evaluation should go beyond just accuracy
Model Performance Summary:
Our models achieved good accuracy in predicting loan approvals, with Random Forest typically performing slightly better due to its ability to handle non-linear relationships and feature interactions.
Next Steps and Further Learning
Immediate Improvements:
- Handle class imbalance using techniques like SMOTE
- Cross-validation for more robust model evaluation
- Feature selection to identify the most predictive variables
- Ensemble methods combining multiple models
Advanced Techniques:
# Example: Handling class imbalance with SMOTE
from imblearn.over_sampling import SMOTE
smote = SMOTE(random_state=42)
X_train_balanced, y_train_balanced = smote.fit_resample(X_train, y_train)
print(f"Original training set: {y_train.value_counts()}")
print(f"Balanced training set: {pd.Series(y_train_balanced).value_counts()}")
Learning Resources:
- Scikit-learn Documentation: Comprehensive guide to ML algorithms
- Kaggle Learn: Free micro-courses on machine learning
- Financial ML Books: "Advances in Financial Machine Learning" by Marcos López de Prado
- Practice Datasets: Explore more finance datasets on Kaggle
Real-World Applications:
- Credit scoring systems
- Insurance claim prediction
- Fraud detection
- Investment risk assessment
Deployment Considerations:
- Model interpretability for regulatory compliance
- Regular model retraining with new data
- A/B testing for model updates
- Monitoring for model drift
Keep practicing with different datasets and algorithms to strengthen your machine learning skills in finance. The field of financial ML is rapidly evolving, offering exciting opportunities for data scientists and analysts!
Remember: The key to mastering machine learning is consistent practice and staying curious about new techniques and applications. Happy coding! 🚀
- Get link
- X
- Other Apps
Comments
Post a Comment