Predict Heart Disease with Logistic Regression
- Get link
- X
- Other Apps
Predict Heart Disease with Logistic Regression
Introduction
Heart disease remains one of the leading causes of death worldwide. Early detection and risk assessment can significantly improve patient outcomes and reduce healthcare costs. In this tutorial, we’ll use machine learning, specifically logistic regression, to classify patients based on their risk of heart disease using a real-world dataset.
Why This Dataset?
We’ll use the Kaggle Heart Disease dataset, which contains medical attributes like age, cholesterol levels, blood pressure, and more. This dataset is ideal for beginners because:
- It’s well-structured and clean.
- Features are clinically relevant.
- The target variable (heart disease presence) is binary, perfect for logistic regression.
What You’ll Learn
By the end of this tutorial, you’ll:
- Load and explore a healthcare dataset.
- Preprocess data for machine learning.
- Train a logistic regression model.
- Evaluate model performance using key metrics.
- Interpret results to predict heart disease risk.
Step 1: Set Up Your Environment
Before diving into the code, ensure you have the necessary tools installed:
- Python 3.8+
- Libraries: Pandas, NumPy, Scikit-learn, Matplotlib, Seaborn
Install them using pip:
pip install pandas numpy scikit-learn matplotlib seaborn
Step 2: Load and Explore the Dataset
Download the Dataset
Get the dataset from Kaggle Heart Disease Dataset. For this tutorial, we’ll use heart_disease_data.csv.
Load the Data
import pandas as pd
# Load the dataset
data = pd.read_csv('heart_disease_data.csv')
print(data.head())
Understand the Features
The dataset includes:
- Age: Patient age in years.
- Sex: Gender (1 = male, 0 = female).
- Cholesterol: Serum cholesterol in mg/dl.
- Blood Pressure (trestbps): Resting blood pressure.
- Target: 1 = heart disease, 0 = no heart disease.
Exploratory Data Analysis (EDA)
Visualize the data to understand distributions and relationships.
Check for Missing Values
print(data.isnull().sum())
No missing values? Great! Proceed to visualization.
Plot Target Distribution
import matplotlib.pyplot as plt
import seaborn as sns
sns.countplot(x='target', data=data)
plt.title('Distribution of Heart Disease Cases')
plt.show()

Observation: The dataset is balanced, with roughly equal cases of heart disease (1) and no heart disease (0).
Correlation Heatmap
plt.figure(figsize=(10, 6))
sns.heatmap(data.corr(), annot=True, cmap='coolwarm')
plt.title('Feature Correlation Heatmap')
plt.show()

Key Insight: Features like chol (cholesterol) and trestbps (blood pressure) show moderate correlation with the target.
Step 3: Preprocess the Data
Split Features and Target
X = data.drop('target', axis=1) # Features
y = data['target'] # Target variable
Train-Test Split
Divide the data into training (80%) and testing (20%) sets.
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
Feature Scaling
Logistic regression benefits from scaled features. Use StandardScaler:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
Step 4: Train the Logistic Regression Model
Initialize and Fit the Model
from sklearn.linear_model import LogisticRegression
model = LogisticRegression(max_iter=1000) # Increase iterations for convergence
model.fit(X_train_scaled, y_train)
Predict on Test Data
y_pred = model.predict(X_test_scaled)
Step 5: Evaluate Model Performance
Confusion Matrix
from sklearn.metrics import confusion_matrix, classification_report
cm = confusion_matrix(y_test, y_pred)
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues')
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title('Confusion Matrix')
plt.show()

Interpretation:
- True Positives (TP): Correctly predicted heart disease.
- False Negatives (FN): Missed heart disease cases (critical in healthcare!).
Classification Report
print(classification_report(y_test, y_pred))
Key Metrics:
- Precision: % of predicted positives that are correct.
- Recall: % of actual positives correctly predicted.
- F1-Score: Balance between precision and recall.
ROC Curve
from sklearn.metrics import roc_curve, auc
y_prob = model.predict_proba(X_test_scaled)[:, 1]
fpr, tpr, _ = roc_curve(y_test, y_prob)
roc_auc = auc(fpr, tpr)
plt.plot(fpr, tpr, label=f'ROC Curve (AUC = {roc_auc:.2f})')
plt.plot([0, 1], [0, 1], 'k--')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve')
plt.legend()
plt.show()

AUC Score: Closer to 1.0 means better performance. A score of 0.85+ is excellent for this use case.
Step 6: Interpret the Model
Feature Importance
Logistic regression coefficients indicate feature impact:
feature_importance = pd.DataFrame({
'Feature': X.columns,
'Coefficient': model.coef_[0]
}).sort_values('Coefficient', ascending=False)
sns.barplot(x='Coefficient', y='Feature', data=feature_importance)
plt.title('Feature Importance')
plt.show()

Insight: Features like chol and age have the highest positive coefficients, meaning they increase heart disease risk.
Conclusion & Next Steps
What You Achieved
- Built a logistic regression model to predict heart disease.
- Learned data preprocessing, model training, and evaluation.
- Interpreted results using confusion matrices, ROC curves, and feature importance.
Improvements & Next Steps
- Try Other Models: Compare with Random Forest or SVM.
- Hyperparameter Tuning: Use
GridSearchCVto optimize the model. - Feature Engineering: Create new features (e.g., BMI from height/weight).
- Deploy the Model: Use Flask/FastAPI to build a web app for predictions.
Further Learning
Ready to predict heart disease risk? Share your results or questions in the comments! 🚀
- Get link
- X
- Other Apps
Comments
Post a Comment