Data Preprocessing & Feature Engineering for Machine Learning - Predicting House Prices in California

This project comes from "The Complete Pandas Bootcamp 2023 - Data Science with Python", a course offered by Udemy and taught by Alexander Hagmann.

The dataset for this project comes from the instructor.

In this project, we are going to predict house prices for districts in California by training and testing an appropriate machine learning model on a dataset containing information on more than 20,000 districts in California. We are going to use the Random Forest Regressor model.

We plan to visualize California on a map where each point represents a district. The median housing prices in these districts will be illustrated using a color gradient.

We will try to understand why some districts are more expensive than others. We will try to find the most important features that determine house prices, for instance, location or median income.

Finally, we are going to train and test a machine learning model with scikit learn that allows us to predict prices for districts where we don't have any information on house prices.

The primary emphasis will be on:

The importation and examination of the dataset to ensure its readiness for analysis.
Conducting Explanatory Data Analysis with Pandas and Seaborn to understand the data to select and create appropriate features and select an appropriate model.
Undertaking data pre-processing and feature engineering using Pandas to prepare the data for modeling.

While integrating these steps into a scikit-learn pre-processing pipeline is common for ongoing machine learning operations, for a one-off project like ours, employing straightforward Pandas code for data manipulation and preparation is equally effective.

View this notebook with interactive links on nbviewer

Data Preprocessing & Feature Engineering for Machine Learning - Predicting House Prices in California

(1) Project Introduction
(2) Data Import and Inspection
- (2.1) Initial Overview and Some Cleaning Recommendations
(3) Data Cleaning
- (3.1) Dropping Missing Values
- (3.2) Evaluating the Impact of Capped Values
(4) Feature Engineering - Part 1
(5) Exploratory Data Analysis
(6) Preprocessing for Machine Learning - Part 2
(7) Splitting the Data into Train and Test Sets
- (7.1) Importance of Train/Test Split
(8) Training the ML Model (Random Forest Regressor)
- (8.1) Criteria for Using Random Forest Regressor
- (8.2) Training Process
(9) Testing/Evaluating the Model on the Test Set
- (9.1) Calculating RMSE on the Test Set
- (9.2) Mean Absolute Error Analysis
(10) Feature Importance

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
data		data
support_files		support_files
.gitignore		.gitignore
MachineLearning_Predict_House_Prices.ipynb		MachineLearning_Predict_House_Prices.ipynb
Pipfile		Pipfile
Pipfile.lock		Pipfile.lock
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Data Preprocessing & Feature Engineering for Machine Learning - Predicting House Prices in California

Contents

About

Releases

Packages

Languages

Pacode74/CA-HousePrice-Predictor

Folders and files

Latest commit

History

Repository files navigation

Data Preprocessing & Feature Engineering for Machine Learning - Predicting House Prices in California

Contents

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages