Predict Customer Churn with Random Forest
- Get link
- X
- Other Apps
Predict Customer Churn with Random Forest
Introduction
Customer churn—when customers stop using a service—is a critical challenge for telecommunication companies. Predicting churn allows businesses to proactively retain customers, reduce revenue loss, and improve customer satisfaction. In this tutorial, we’ll use Python and Random Forest, a powerful machine learning algorithm, to predict customer churn using the Kaggle Telco Churn dataset.
By the end of this guide, you’ll learn:
- How to load and preprocess real-world telecom data.
- How to train and evaluate a Random Forest classifier.
- How to interpret feature importance to understand what drives churn.
Let’s get started!
Step 1: Load and Explore the Dataset
First, we’ll load the dataset and perform Exploratory Data Analysis (EDA) to understand its structure.
1.1 Import Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
1.2 Load the Dataset
The Kaggle Telco Churn dataset contains customer information such as tenure, contract type, and monthly charges.
# Load the dataset
data = pd.read_csv('telco_churn.csv')
print(data.head())
1.3 Basic Exploration
# Check dataset shape and missing values
print(f"Dataset shape: {data.shape}")
print("\nMissing values:\n", data.isnull().sum())
# Check data types
print("\nData types:\n", data.dtypes)
Key Observations:
- The dataset has 7043 rows and 21 columns.
- No missing values (if any, we’ll handle them later).
-
Mix of numerical (e.g.,
tenure,MonthlyCharges) and categorical (e.g.,Contract,PaymentMethod) features.
Step 2: Data Preprocessing
2.1 Handle Categorical Variables
Convert categorical variables into numerical format using one-hot encoding.
# Drop 'customerID' (not useful for modeling)
data = data.drop('customerID', axis=1)
# Convert 'TotalCharges' to numeric (some values may be empty strings)
data['TotalCharges'] = pd.to_numeric(data['TotalCharges'], errors='coerce')
# One-hot encode categorical variables
data = pd.get_dummies(data, drop_first=True)
2.2 Check for Missing Values
print("Missing values after conversion:\n", data.isnull().sum())
# Fill missing 'TotalCharges' with median (if any)
data['TotalCharges'] = data['TotalCharges'].fillna(data['TotalCharges'].median())
2.3 Split Features and Target
X = data.drop('Churn_Yes', axis=1) # Features
y = data['Churn_Yes'] # Target (1 = Churned, 0 = Not Churned)
2.4 Train-Test Split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
Step 3: Train the Random Forest Model
3.1 Initialize and Fit the Model
# Initialize Random Forest
rf_model = RandomForestClassifier(n_estimators=100, random_state=42)
# Train the model
rf_model.fit(X_train, y_train)
3.2 Make Predictions
# Predict on test set
y_pred = rf_model.predict(X_test)
Step 4: Evaluate Model Performance
4.1 Accuracy and Classification Report
# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.2f}")
# Classification report
print("\nClassification Report:\n", classification_report(y_test, y_pred))
Output Example:
Accuracy: 0.80
Classification Report:
precision recall f1-score support
0 0.85 0.92 0.88 1406
1 0.65 0.48 0.55 587
accuracy 0.80 1993
macro avg 0.75 0.70 0.72 1993
weighted avg 0.79 0.80 0.79 1993
Interpretation:
- The model has 80% accuracy.
- Precision (65%) for churn prediction is lower than recall (48%), meaning it correctly identifies fewer churners but has fewer false positives.
4.2 Confusion Matrix
# Plot confusion matrix
cm = confusion_matrix(y_test, y_pred)
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues')
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title('Confusion Matrix')
plt.show()

Key Insights:
- True Negatives (TN): 1293 customers correctly predicted as non-churners.
- False Positives (FP): 113 customers incorrectly predicted as churners.
- False Negatives (FN): 306 churners missed by the model.
Step 5: Feature Importance Analysis
5.1 Extract Feature Importance
# Get feature importances
importances = rf_model.feature_importances_
feature_names = X.columns
# Create a DataFrame for visualization
feature_importance = pd.DataFrame({'Feature': feature_names, 'Importance': importances})
feature_importance = feature_importance.sort_values('Importance', ascending=False)
# Plot top 10 features
plt.figure(figsize=(10, 6))
sns.barplot(x='Importance', y='Feature', data=feature_importance.head(10))
plt.title('Top 10 Features Driving Churn')
plt.show()

Key Findings:
-
tenure(customer loyalty) is the most important predictor. -
MonthlyChargesandContract_Month-to-monthalso significantly impact churn.
Conclusion & Next Steps
Summary
- We successfully built a Random Forest model to predict customer churn with 80% accuracy.
- Key drivers of churn include tenure, monthly charges, and contract type.
Improvements
-
Hyperparameter Tuning: Use
GridSearchCVto optimizen_estimators,max_depth, etc. -
Feature Engineering: Create new features (e.g.,
ChurnRiskScore). - Imbalanced Data Handling: Apply SMOTE or class weights if churn is rare.
Further Learning
- Explore XGBoost or Logistic Regression for comparison.
- Deploy the model using Flask or Streamlit.
Ready to predict churn in your own dataset? Download the Kaggle Telco Churn dataset and follow this guide step-by-step!
Happy modeling! 🚀
- Get link
- X
- Other Apps
Comments
Post a Comment