Predict Customer Churn with Random Forest

Introduction

Customer churn—when customers stop using a service—is a critical challenge for telecommunication companies. Predicting churn allows businesses to proactively retain customers, reduce revenue loss, and improve customer satisfaction. In this tutorial, we’ll use Python and Random Forest, a powerful machine learning algorithm, to predict customer churn using the Kaggle Telco Churn dataset.

By the end of this guide, you’ll learn:

How to load and preprocess real-world telecom data.
How to train and evaluate a Random Forest classifier.
How to interpret feature importance to understand what drives churn.

Let’s get started!

Step 1: Load and Explore the Dataset

First, we’ll load the dataset and perform Exploratory Data Analysis (EDA) to understand its structure.

1.1 Import Libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

1.2 Load the Dataset

The Kaggle Telco Churn dataset contains customer information such as tenure, contract type, and monthly charges.

# Load the dataset
data = pd.read_csv('telco_churn.csv')
print(data.head())

1.3 Basic Exploration

# Check dataset shape and missing values
print(f"Dataset shape: {data.shape}")
print("\nMissing values:\n", data.isnull().sum())

# Check data types
print("\nData types:\n", data.dtypes)

Key Observations:

The dataset has 7043 rows and 21 columns.
No missing values (if any, we’ll handle them later).
Mix of numerical (e.g., tenure, MonthlyCharges) and categorical (e.g., Contract, PaymentMethod) features.

Step 2: Data Preprocessing

2.1 Handle Categorical Variables

Convert categorical variables into numerical format using one-hot encoding.

# Drop 'customerID' (not useful for modeling)
data = data.drop('customerID', axis=1)

# Convert 'TotalCharges' to numeric (some values may be empty strings)
data['TotalCharges'] = pd.to_numeric(data['TotalCharges'], errors='coerce')

# One-hot encode categorical variables
data = pd.get_dummies(data, drop_first=True)

2.2 Check for Missing Values

print("Missing values after conversion:\n", data.isnull().sum())

# Fill missing 'TotalCharges' with median (if any)
data['TotalCharges'] = data['TotalCharges'].fillna(data['TotalCharges'].median())

2.3 Split Features and Target

X = data.drop('Churn_Yes', axis=1)  # Features
y = data['Churn_Yes']               # Target (1 = Churned, 0 = Not Churned)

2.4 Train-Test Split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

Step 3: Train the Random Forest Model

3.1 Initialize and Fit the Model

# Initialize Random Forest
rf_model = RandomForestClassifier(n_estimators=100, random_state=42)

# Train the model
rf_model.fit(X_train, y_train)

3.2 Make Predictions

# Predict on test set
y_pred = rf_model.predict(X_test)

Step 4: Evaluate Model Performance

4.1 Accuracy and Classification Report

# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.2f}")

# Classification report
print("\nClassification Report:\n", classification_report(y_test, y_pred))

Output Example:

Accuracy: 0.80

Classification Report:
              precision    recall  f1-score   support
           0       0.85      0.92      0.88      1406
           1       0.65      0.48      0.55       587
    accuracy                           0.80      1993
   macro avg       0.75      0.70      0.72      1993
weighted avg       0.79      0.80      0.79      1993

Interpretation:

The model has 80% accuracy.
Precision (65%) for churn prediction is lower than recall (48%), meaning it correctly identifies fewer churners but has fewer false positives.

4.2 Confusion Matrix

# Plot confusion matrix
cm = confusion_matrix(y_test, y_pred)
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues')
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title('Confusion Matrix')
plt.show()

Confusion Matrix

Key Insights:

True Negatives (TN): 1293 customers correctly predicted as non-churners.
False Positives (FP): 113 customers incorrectly predicted as churners.
False Negatives (FN): 306 churners missed by the model.

Step 5: Feature Importance Analysis

5.1 Extract Feature Importance

# Get feature importances
importances = rf_model.feature_importances_
feature_names = X.columns

# Create a DataFrame for visualization
feature_importance = pd.DataFrame({'Feature': feature_names, 'Importance': importances})
feature_importance = feature_importance.sort_values('Importance', ascending=False)

# Plot top 10 features
plt.figure(figsize=(10, 6))
sns.barplot(x='Importance', y='Feature', data=feature_importance.head(10))
plt.title('Top 10 Features Driving Churn')
plt.show()

Feature Importance

Key Findings:

tenure (customer loyalty) is the most important predictor.
MonthlyCharges and Contract_Month-to-month also significantly impact churn.

Conclusion & Next Steps

Summary

We successfully built a Random Forest model to predict customer churn with 80% accuracy.
Key drivers of churn include tenure, monthly charges, and contract type.

Improvements

Hyperparameter Tuning: Use GridSearchCV to optimize n_estimators, max_depth, etc.
Feature Engineering: Create new features (e.g., ChurnRiskScore).
Imbalanced Data Handling: Apply SMOTE or class weights if churn is rare.

Further Learning

Explore XGBoost or Logistic Regression for comparison.
Deploy the model using Flask or Streamlit.

Ready to predict churn in your own dataset? Download the Kaggle Telco Churn dataset and follow this guide step-by-step!

Happy modeling! 🚀

Search This Blog

AI Mentor Lab

Predict Loan Default with Decision Trees