Predict Loan Default with Decision Trees

Image
Predict Loan Default with Decision Trees: A Step-by-Step Guide Introduction Predicting loan default is a critical task in finance analytics. Banks and financial institutions need to assess the risk associated with lending money to individuals or businesses. By accurately predicting whether a borrower will default, lenders can minimize losses and make informed decisions. In this tutorial, we will use a Decision Tree classifier to predict loan default risk. Decision Trees are intuitive, interpretable, and effective for classification tasks, making them ideal for beginners in finance analytics. We will use the Kaggle Loan Prediction dataset , which contains historical loan data with features like income, credit history, and loan amount. This dataset is well-suited for this tutorial because it includes both numerical and categorical features, allowing us to practice data cleaning, feature encoding, and model evaluation. By the end of this tutorial, you will learn: How to load and ...

Predict Heart Disease with Logistic Regression

Predict Heart Disease with Logistic Regression

Introduction

Heart disease remains one of the leading causes of death worldwide. Early detection and risk assessment can significantly improve patient outcomes and reduce healthcare costs. In this tutorial, we’ll use machine learning, specifically logistic regression, to classify patients based on their risk of heart disease using a real-world dataset.

Why This Dataset?

We’ll use the Kaggle Heart Disease dataset, which contains medical attributes like age, cholesterol levels, blood pressure, and more. This dataset is ideal for beginners because:

  • It’s well-structured and clean.
  • Features are clinically relevant.
  • The target variable (heart disease presence) is binary, perfect for logistic regression.

What You’ll Learn

By the end of this tutorial, you’ll:

  1. Load and explore a healthcare dataset.
  2. Preprocess data for machine learning.
  3. Train a logistic regression model.
  4. Evaluate model performance using key metrics.
  5. Interpret results to predict heart disease risk.

Step 1: Set Up Your Environment

Before diving into the code, ensure you have the necessary tools installed:

  • Python 3.8+
  • Libraries: Pandas, NumPy, Scikit-learn, Matplotlib, Seaborn

Install them using pip:

pip install pandas numpy scikit-learn matplotlib seaborn

Step 2: Load and Explore the Dataset

Download the Dataset

Get the dataset from Kaggle Heart Disease Dataset. For this tutorial, we’ll use heart_disease_data.csv.

Load the Data

import pandas as pd

# Load the dataset
data = pd.read_csv('heart_disease_data.csv')
print(data.head())

Understand the Features

The dataset includes:

  • Age: Patient age in years.
  • Sex: Gender (1 = male, 0 = female).
  • Cholesterol: Serum cholesterol in mg/dl.
  • Blood Pressure (trestbps): Resting blood pressure.
  • Target: 1 = heart disease, 0 = no heart disease.

Exploratory Data Analysis (EDA)

Visualize the data to understand distributions and relationships.

Check for Missing Values

print(data.isnull().sum())

No missing values? Great! Proceed to visualization.

Plot Target Distribution

import matplotlib.pyplot as plt
import seaborn as sns

sns.countplot(x='target', data=data)
plt.title('Distribution of Heart Disease Cases')
plt.show()

plot
Observation: The dataset is balanced, with roughly equal cases of heart disease (1) and no heart disease (0).

Correlation Heatmap

plt.figure(figsize=(10, 6))
sns.heatmap(data.corr(), annot=True, cmap='coolwarm')
plt.title('Feature Correlation Heatmap')
plt.show()

plot
Key Insight: Features like chol (cholesterol) and trestbps (blood pressure) show moderate correlation with the target.


Step 3: Preprocess the Data

Split Features and Target

X = data.drop('target', axis=1)  # Features
y = data['target']               # Target variable

Train-Test Split

Divide the data into training (80%) and testing (20%) sets.

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

Feature Scaling

Logistic regression benefits from scaled features. Use StandardScaler:

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

Step 4: Train the Logistic Regression Model

Initialize and Fit the Model

from sklearn.linear_model import LogisticRegression

model = LogisticRegression(max_iter=1000)  # Increase iterations for convergence
model.fit(X_train_scaled, y_train)

Predict on Test Data

y_pred = model.predict(X_test_scaled)

Step 5: Evaluate Model Performance

Confusion Matrix

from sklearn.metrics import confusion_matrix, classification_report

cm = confusion_matrix(y_test, y_pred)
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues')
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title('Confusion Matrix')
plt.show()

plot
Interpretation:

  • True Positives (TP): Correctly predicted heart disease.
  • False Negatives (FN): Missed heart disease cases (critical in healthcare!).

Classification Report

print(classification_report(y_test, y_pred))

Key Metrics:

  • Precision: % of predicted positives that are correct.
  • Recall: % of actual positives correctly predicted.
  • F1-Score: Balance between precision and recall.

ROC Curve

from sklearn.metrics import roc_curve, auc

y_prob = model.predict_proba(X_test_scaled)[:, 1]
fpr, tpr, _ = roc_curve(y_test, y_prob)
roc_auc = auc(fpr, tpr)

plt.plot(fpr, tpr, label=f'ROC Curve (AUC = {roc_auc:.2f})')
plt.plot([0, 1], [0, 1], 'k--')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve')
plt.legend()
plt.show()

plot
AUC Score: Closer to 1.0 means better performance. A score of 0.85+ is excellent for this use case.


Step 6: Interpret the Model

Feature Importance

Logistic regression coefficients indicate feature impact:

feature_importance = pd.DataFrame({
    'Feature': X.columns,
    'Coefficient': model.coef_[0]
}).sort_values('Coefficient', ascending=False)

sns.barplot(x='Coefficient', y='Feature', data=feature_importance)
plt.title('Feature Importance')
plt.show()

plot
Insight: Features like chol and age have the highest positive coefficients, meaning they increase heart disease risk.


Conclusion & Next Steps

What You Achieved

  • Built a logistic regression model to predict heart disease.
  • Learned data preprocessing, model training, and evaluation.
  • Interpreted results using confusion matrices, ROC curves, and feature importance.

Improvements & Next Steps

  1. Try Other Models: Compare with Random Forest or SVM.
  2. Hyperparameter Tuning: Use GridSearchCV to optimize the model.
  3. Feature Engineering: Create new features (e.g., BMI from height/weight).
  4. Deploy the Model: Use Flask/FastAPI to build a web app for predictions.

Further Learning


Ready to predict heart disease risk? Share your results or questions in the comments! 🚀

Comments

Popular posts from this blog

Build a Loan Approval Predictor Using Python and Machine Learning

Predict Stock Prices with LSTM Networks

Predict Customer Churn with Random Forest