Data Science Salary Estimator

Project overview

This is an end-to-end data science project for salary estimation of any data science roles.

Web scrapping is done using selenium and python from glassdoor for various job roles ( upto 1000 jobs).
Engineered features from the text of each job description to quantify the value companies put on python, excel, aws, and spark.
Exploratory data analysis was performed using seaborn, matplotlib and pandas libraries.
Linear regression, Lasso regression and random forest regression models were used,Random forest performed the best ( MAE ~ 45k Rs).
Built Flask API endpoint was created.
Created a webapp using streamlit and python.

Getting Started

Install all the packages required: flaskAPI/requirements.txt
Get the virtual environment running
Run the Flask API from command line

python flaskAPI/wsgi.py
Run the streamlit app from command line

streamlit run flask/streamlitapp.py

Web scrapping

Using selenium and python managed to scrape folling parameters from glassdoor for data science roles

Job title
Salary Estimate
Job Description
Rating
Company
Location
Company Headquarters
Company Size
Company Founded Date
Type of Ownership
Industry
Sector
Revenue
Competitors

Data for following job roles was scrapped 'AI engineer', 'Business analyst', 'Big data engineer', 'Deep learning engineer', 'Machine learning engineer','NLP engineer', 'Data analyst', 'Data engineer', 'Data Scientist', 'manager' and 'software engineer'.

Data cleaning

After scraping the data, I needed to clean it up so that it was usable for our model. I made the following changes and created the following variables:

Parsing the salary column to get the numeric data only.
Edit the company name column. Split rating appended with the company name and remove duplicates.
Parsed out important details out of job description column.
Created separate columns for skills mentioned such as Python, R, AWS, Spark, database, R studio, Tableau, Tensorflow, NLTK, Power BI, Excel, Hadoop, Azure, scikit-learn, etc.
another column for degree requirement.
Simplified job roles into broader categories.
1. Data scientist
2. Data analyst
3. Machine learning Engineer
4. Big data engineer
5. Data engineer
6. NLP engineer
7. AI engineer
8. manager
9. director
10. business analyst
Column for description length. Final cleaning reduced the the csv to 594 columns. Thus the data on which the model is trained is very small.

Exploratory data analysis

I looked at the distributions of the data and the value counts for the various categorical variables.

Model building

First, I transformed the categorical variables into dummy variables. I also split the data into train and tests sets with a test size of 20%.

I tried three different models and evaluated them using Mean Absolute Error. I chose MAE because it is relatively easy to interpret and outliers aren’t particularly bad in for this type of model.

I tried three different models:

Multiple Linear Regression – Baseline for the model Lasso Regression – Because of the sparse data from the many categorical variables, I thought a normalized regression like lasso would be effective. Random Forest – Again, with the sparsity associated with the data, I thought that this would be a good fit.

Productionization

In this step, I built a flask API endpoint that was hosted on a local webserver by following along with the TDS tutorial in the resources section below. The API endpoint takes in a request with a list of values from a job listing and returns an estimated salary. Created a Webapp using Streamlit and python on local host.

Resources

Glassdoor web scrapper: https://github.com/arapfaik/scraping-glassdoor-selenium

CHANGES:

1. changes made for the Indian glassdoor webpage (https://www.glassdoor.co.in/)
2. salary estimate extraction from the salary tab for data scientist

Code productionization: https://towardsdatascience.com/productionize-a-machine-learning-model-with-flask-and-heroku-8201260503d2
Project Inspiration:

Github: https://github.com/PlayingNumbers/ds_salary_proj
Youtube: https://youtu.be/MpF9HENQjDo

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
.idea		.idea
.ipynb_checkpoints		.ipynb_checkpoints
__pycache__		__pycache__
csv_files		csv_files
flaskAPI		flaskAPI
images		images
EDA.ipynb		EDA.ipynb
README.md		README.md
data_cleaning.ipynb		data_cleaning.ipynb
data_collection.py		data_collection.py
glassdoor_webscrapper.py		glassdoor_webscrapper.py
model_building.ipynb		model_building.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Data Science Salary Estimator

Project overview

Getting Started

Web scrapping

Data cleaning

Exploratory data analysis

Model building

Productionization

Resources

About

Releases

Packages

Languages

swarali-desai/salary_estimate

Folders and files

Latest commit

History

Repository files navigation

Data Science Salary Estimator

Project overview

Getting Started

Web scrapping

Data cleaning

Exploratory data analysis

Model building

Productionization

Resources

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages