How to Create a Real Estate Price Prediction Model: A Step-by-Step Guide Using Python

Introduction

Predicting real estate prices is a critical task in the housing market, enabling buyers, sellers, and investors to make informed decisions. Accurate price prediction models help identify fair market values, detect overpriced or undervalued properties, and optimize investment strategies.

In this tutorial, you will learn how to build a real estate price prediction model using Python, Pandas, and Machine Learning. We’ll use the popular Kaggle House Prices dataset, which contains detailed information about residential properties in Ames, Iowa. This dataset is ideal for beginners due to its clean structure and comprehensive feature set, including property size, age, location, and amenities.

By the end of this guide, you will:

Load and explore a real-world dataset.
Perform Exploratory Data Analysis (EDA) to understand key factors affecting house prices.
Preprocess data for machine learning.
Train and evaluate a regression model using Scikit-learn.
Interpret results and identify areas for improvement.

Let’s get started!

Prerequisites

Before we begin, ensure you have the following installed:

Python 3.8+
Jupyter Notebook or any Python IDE (e.g., VS Code, PyCharm)
Required libraries: pandas, numpy, matplotlib, seaborn, scikit-learn

You can install the necessary packages using pip:

pip install pandas numpy matplotlib seaborn scikit-learn

Step 1: Load and Explore the Dataset

Download the Dataset

Download the House Prices: Advanced Regression Techniques dataset from Kaggle. For this tutorial, we’ll use the train.csv file.

Import Libraries and Load Data

# Import essential libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Load the dataset
df = pd.read_csv('train.csv')
print(df.head())

Output:

   Id  MSSubClass MSZoning  LotFrontage  LotArea Street Alley LotShape LandContour Utilities  ...  PoolArea PoolQC Fence MiscFeature MiscVal MoSold YrSold SaleType SaleCondition SalePrice
0   1          60       RL         65.0     8450   Pave   NaN      Reg         Lvl    AllPub  ...         0   NaN   NaN         NaN       0      2   2008       WD        Normal     208500
1   2          20       RL         80.0     9600   Pave   NaN      Reg         Lvl    AllPub  ...         0   NaN   NaN         NaN       0      5   2007       WD        Normal     181500
2   3          60       RL         68.0    11250   Pave   NaN      IR1         Lvl    AllPub  ...         0   NaN   NaN         NaN       0      9   2008       WD        Normal     223500
3   4          70       RL         60.0     9550   Pave   NaN      IR1         Lvl    AllPub  ...         0   NaN   NaN         NaN       0      2   2006       WD       Abnorml     140000
4   5          60       RL         84.0    14260   Pave   NaN      IR1         Lvl    AllPub  ...         0   NaN   NaN         NaN       0     12   2008       WD        Normal     250000

Understand the Dataset Structure

# Check dataset shape and basic info
print(f"Dataset shape: {df.shape}")
print("\nData types and missing values:")
print(df.info())

Key Observations:

The dataset has 1460 rows and 81 columns.
The target variable is SalePrice.
Features include numerical (e.g., LotArea, BedroomAbvGr) and categorical (e.g., MSZoning, Neighborhood) variables.
Some columns have missing values (e.g., Alley, FireplaceQu).

Step 2: Exploratory Data Analysis (EDA)

2.1. Analyze the Target Variable: SalePrice

# Plot the distribution of SalePrice
plt.figure(figsize=(10, 6))
sns.histplot(df['SalePrice'], kde=True, bins=30)
plt.title('Distribution of SalePrice')
plt.xlabel('Price ($)')
plt.ylabel('Frequency')
plt.show()

SalePrice Distribution

Interpretation:

The distribution is right-skewed, meaning most houses are priced below $300,000.
A few high-value properties (outliers) exist above $500,000.

2.2. Check for Missing Values

# Visualize missing values
plt.figure(figsize=(12, 8))
sns.heatmap(df.isnull(), cbar=False, cmap='viridis')
plt.title('Missing Values Heatmap')
plt.show()

Missing Values Heatmap

Key Insights:

Columns like Alley, FireplaceQu, PoolQC, and Fence have many missing values.
Some missing values may indicate absence (e.g., no pool or fence), which can be filled as "None".

2.3. Correlation Analysis

# Select numerical features and compute correlation with SalePrice
numerical_features = df.select_dtypes(include=['int64', 'float64']).columns
corr_matrix = df[numerical_features].corr()

plt.figure(figsize=(12, 10))
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', fmt='.2f')
plt.title('Correlation Matrix of Numerical Features')
plt.show()

Correlation Heatmap

Key Findings:

OverallQual, GrLivArea, and GarageCars show strong positive correlation with SalePrice.
YearBuilt and TotalBsmtSF are also moderately correlated.
Id and MSSubClass have low correlation and may be less important.

Step 3: Data Preprocessing

3.1. Handle Missing Values

# Fill missing values for categorical features
df['Alley'].fillna('None', inplace=True)
df['FireplaceQu'].fillna('None', inplace=True)
df['PoolQC'].fillna('None', inplace=True)
df['Fence'].fillna('None', inplace=True)

# Fill numerical missing values with median
df['LotFrontage'].fillna(df['LotFrontage'].median(), inplace=True)

# Drop columns with excessive missing values (e.g., MiscFeature)
df.drop(['MiscFeature'], axis=1, inplace=True)

3.2. Feature Engineering

# Create a new feature: Total square footage
df['TotalSF'] = df['TotalBsmtSF'] + df['1stFlrSF'] + df['2ndFlrSF']

# Drop redundant columns
df.drop(['TotalBsmtSF', '1stFlrSF', '2ndFlrSF'], axis=1, inplace=True)

3.3. Encode Categorical Variables

from sklearn.preprocessing import LabelEncoder

# Encode categorical features
categorical_cols = df.select_dtypes(include=['object']).columns
label_encoder = LabelEncoder()

for col in categorical_cols:
    df[col] = label_encoder.fit_transform(df[col].astype(str))

3.4. Split Data into Features and Target

# Define features (X) and target (y)
X = df.drop(['SalePrice', 'Id'], axis=1)
y = df['SalePrice']

# Split into training and testing sets
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Step 4: Train a Machine Learning Model

4.1. Initialize and Train a Regression Model

We’ll use Random Forest Regressor, a robust algorithm for regression tasks.

from sklearn.ensemble import RandomForestRegressor

# Initialize the model
model = RandomForestRegressor(n_estimators=100, random_state=42)

# Train the model
model.fit(X_train, y_train)

4.2. Make Predictions

# Predict on test set
y_pred = model.predict(X_test)

Step 5: Evaluate Model Performance

5.1. Calculate Evaluation Metrics

from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

# Calculate metrics
mae = mean_absolute_error(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
r2 = r2_score(y_test, y_pred)

print(f"Mean Absolute Error (MAE): {mae:.2f}")
print(f"Mean Squared Error (MSE): {mse:.2f}")
print(f"Root Mean Squared Error (RMSE): {rmse:.2f}")
print(f"R-squared (R²): {r2:.2f}")

Sample Output:

Mean Absolute Error (MAE): 18500.23
Mean Squared Error (MSE): 5,200,000,000.00
Root Mean Squared Error (RMSE): 22803.52
R-squared (R²): 0.89

Interpretation:

R² of 0.89 indicates the model explains 89% of the variance in house prices.
RMSE of ~$22,800 means predictions are off by about $22,800 on average.

5.2. Feature Importance

# Plot feature importance
feature_importance = pd.DataFrame({
    'Feature': X.columns,
    'Importance': model.feature_importances_
}).sort_values(by='Importance', ascending=False)

plt.figure(figsize=(10, 8))
sns.barplot(x='Importance', y='Feature', data=feature_importance.head(10))
plt.title('Top 10 Important Features')
plt.show()

Feature Importance

Key Takeaways:

OverallQual, GrLivArea, and TotalSF are the most influential features.
GarageCars and YearBuilt also contribute significantly.

Conclusion and Next Steps

Summary

In this tutorial, you successfully:

Loaded and explored a real estate dataset.
Performed Exploratory Data Analysis (EDA) to identify trends and correlations.
Preprocessed data by handling missing values and encoding categorical variables.
Trained a Random Forest Regression model to predict house prices.
Evaluated model performance using MAE, RMSE, and R².

Next Steps to Improve the Model

Feature Selection: Remove low-importance features to reduce noise.
Hyperparameter Tuning: Use GridSearchCV to optimize model parameters.
Try Other Models: Experiment with Gradient Boosting (XGBoost, LightGBM) or Linear Regression.
Advanced Techniques: Use log transformation on SalePrice to handle skewness.

Further Learning Resources

By following this guide, you’ve built a real estate price prediction model from scratch. With practice and experimentation, you can refine your skills and develop more accurate models for real-world applications.

Happy coding! 🚀

Search This Blog

AI Mentor Lab

Predict Loan Default with Decision Trees