This is an end-to-end data science project for salary estimation of any data science roles.
- Web scrapping is done using selenium and python from glassdoor for various job roles ( upto 1000 jobs).
- Engineered features from the text of each job description to quantify the value companies put on python, excel, aws, and spark.
- Exploratory data analysis was performed using seaborn, matplotlib and pandas libraries.
- Linear regression, Lasso regression and random forest regression models were used,Random forest performed the best ( MAE ~ 45k Rs).
- Built Flask API endpoint was created.
- Created a webapp using streamlit and python.
-
Install all the packages required: flaskAPI/requirements.txt
-
Get the virtual environment running
-
Run the Flask API from command line
python flaskAPI/wsgi.py
-
Run the streamlit app from command line
streamlit run flask/streamlitapp.py
Using selenium and python managed to scrape folling parameters from glassdoor for data science roles
- Job title
- Salary Estimate
- Job Description
- Rating
- Company
- Location
- Company Headquarters
- Company Size
- Company Founded Date
- Type of Ownership
- Industry
- Sector
- Revenue
- Competitors
Data for following job roles was scrapped 'AI engineer', 'Business analyst', 'Big data engineer', 'Deep learning engineer', 'Machine learning engineer','NLP engineer', 'Data analyst', 'Data engineer', 'Data Scientist', 'manager' and 'software engineer'.
After scraping the data, I needed to clean it up so that it was usable for our model. I made the following changes and created the following variables:
- Parsing the salary column to get the numeric data only.
- Edit the company name column. Split rating appended with the company name and remove duplicates.
- Parsed out important details out of job description column.
- Created separate columns for skills mentioned such as Python, R, AWS, Spark, database, R studio, Tableau, Tensorflow, NLTK, Power BI, Excel, Hadoop, Azure, scikit-learn, etc.
- another column for degree requirement.
- Simplified job roles into broader categories.
- Data scientist
- Data analyst
- Machine learning Engineer
- Big data engineer
- Data engineer
- NLP engineer
- AI engineer
- manager
- director
- business analyst
- Column for description length. Final cleaning reduced the the csv to 594 columns. Thus the data on which the model is trained is very small.
I looked at the distributions of the data and the value counts for the various categorical variables.
First, I transformed the categorical variables into dummy variables. I also split the data into train and tests sets with a test size of 20%.
I tried three different models and evaluated them using Mean Absolute Error. I chose MAE because it is relatively easy to interpret and outliers aren’t particularly bad in for this type of model.
I tried three different models:
Multiple Linear Regression – Baseline for the model Lasso Regression – Because of the sparse data from the many categorical variables, I thought a normalized regression like lasso would be effective. Random Forest – Again, with the sparsity associated with the data, I thought that this would be a good fit.
In this step, I built a flask API endpoint that was hosted on a local webserver by following along with the TDS tutorial in the resources section below. The API endpoint takes in a request with a list of values from a job listing and returns an estimated salary. Created a Webapp using Streamlit and python on local host.
-
Glassdoor web scrapper: https://github.com/arapfaik/scraping-glassdoor-selenium
CHANGES:
1. changes made for the Indian glassdoor webpage (https://www.glassdoor.co.in/) 2. salary estimate extraction from the salary tab for data scientist
-
Code productionization: https://towardsdatascience.com/productionize-a-machine-learning-model-with-flask-and-heroku-8201260503d2
-
Project Inspiration: