This project comes from "The Complete Pandas Bootcamp 2023 - Data Science with Python", a course offered by Udemy and taught by Alexander Hagmann.
The dataset for this project comes from the instructor.
In this project, we are going to predict house prices for districts in California by training and testing an appropriate machine learning model on a dataset containing information on more than 20,000 districts in California. We are going to use the Random Forest Regressor model.
We plan to visualize California on a map where each point represents a district. The median housing prices in these districts will be illustrated using a color gradient.
We will try to understand why some districts are more expensive than others. We will try to find the most important features that determine house prices, for instance, location or median income.
Finally, we are going to train and test a machine learning model with scikit learn that allows us to predict prices for districts where we don't have any information on house prices.
The primary emphasis will be on:
- The importation and examination of the dataset to ensure its readiness for analysis.
- Conducting Explanatory Data Analysis with Pandas and Seaborn to understand the data to select and create appropriate features and select an appropriate model.
- Undertaking data pre-processing and feature engineering using Pandas to prepare the data for modeling.
While integrating these steps into a scikit-learn pre-processing pipeline is common for ongoing machine learning operations, for a one-off project like ours, employing straightforward Pandas code for data manipulation and preparation is equally effective.
View this notebook with interactive links on nbviewer
Data Preprocessing & Feature Engineering for Machine Learning - Predicting House Prices in California
- (1) Project Introduction
- (2) Data Import and Inspection
- (3) Data Cleaning
- (4) Feature Engineering - Part 1
- (5) Exploratory Data Analysis
- (5.1) Ocean Proximity and House Values
- (5.2) Correlation Analysis
- (5.3) Median Income and House Value
- (5.4) Non-linear Relationships Between Location and Prices
- (5.5) Explore "ocean_proximity" and "median_house_value" relationship using barplot
- (5.6) Explore "income_cat" and "ocean_proximity" relationship using heatmap
- (6) Preprocessing for Machine Learning - Part 2
- (7) Splitting the Data into Train and Test Sets
- (8) Training the ML Model (Random Forest Regressor)
- (9) Testing/Evaluating the Model on the Test Set
- (10) Feature Importance