- Overview
- Data
- Feature Engineering Analysis
- Model Performance Metrics
- Results
- Installation
- Challenges and Limitations
- Contributing
- License
- Citation
- Technologies Used
- Next Steps
- Support
This project performs the regression analysis of the Wages of the employees all around US considering different features like the employee industry, area, state, ... etc to find out how the wages are being affected with different features. Which are most critical factors contributing to the variation in wages are also studied.
The entire study is done in the cloud environment utilizing the "Azure ML studio" and the models are deploiyment in the cloud environment itself.
This project was inspired by recent concerns and changes pertaining to employment in the United States along with its impact in business intelligence.
The data is collected from the "US Beaureau of Labor Statistics" sourced from the "State and Metro Area Employment" and "Hours and Earnings Data". More such information about the data can be found by navigating to the following URL. https://learn.microsoft.com/en-us/azure/open-datasets/dataset-us-state-employment-earnings?tabs=azure-storage#data-access
The dataset can also be obtained from the azure ml opne datasets. There are around 64 lakh+ records in the data .
We analyze the importance of different features in our models to understand which factors contribute most to customer churn.
- Variable Importances for Random Forest and LightGBM:
ifrom sklearn.pipeline import FeatureUnion column_group_1 = ['state_code', 'data_type_code', 'supersector_code', 'period', 'footnote_codes', 'supersector_name', 'data_type_text', 'state_name'] column_group_2 = ['seasonal'] column_group_0 = ['area_code', 'industry_code', 'industry_name', 'area_name'] column_group_3 = [['year']] feature_union = FeatureUnion([ ('mapper_0', get_mapper_0(column_group_0)), ('mapper_1', get_mapper_1(column_group_1)), ('mapper_2', get_mapper_2(column_group_2)), ('mapper_3', get_mapper_3(column_group_3)), ]) return feature_union
-
Regression Metrics The following are some of the regression metrics have been used in the study:
- Variance
- Mean Absolute Percentage Error
- Mean Absulute Error
- Normalized Mean Absolute Error
- R2_Score
- Root Mean Squared Error
- Normalized Root Mean Sqaured Error ... etc.
-
Code Snippet for Metrics Analysis
from azureml.training.tabular.preprocessing._dataset_binning import make_dataset_bins
from azureml.training.tabular.score.scoring import score_regression
y_pred = model.predict(X_test)
y_min = np.min(y)
y_max = np.max(y)
y_std = np.std(y)
bin_info = make_dataset_bins(X_test.shape[0], y_test)
metrics = score_regression(
y_test, y_pred, get_metrics_names(), y_max, y_min, y_std, sample_weights, bin_info)
return metrics
The results revealted other than the experience of the employees, the industry and espectially the area of the employees also matters a lot for the high or low variation in wages.
To set up the project environment:
- Clone the repository:
git clone https://github.com/GaneshKotaSLU/Azure-ML---US-Wage-Regression-Analysis.git
- Navigate to the Project Directory:
cd Azure-ML---US-Wage-Regression-Analysis
- Data quality and completeness varied across different employee segments.
- Since it is hosted in azure, once the subscription gets finished or the resource utlization is full, the application cannot be accessible.
- Rewuired to Have the Azure subscription if you would like to deploy the model to the server.
- Due to high volume of the available data and so the data processing and model building will take a lot of time.
Welcome contributions to this project. Please follow these steps:
- Create a new branch (git checkout -b feature/AmazingFeature)
- Commit your changes (git commit -m 'Add some AmazingFeature')
- Push to the branch (git push origin feature/AmazingFeature)
- Open a Pull Request
This project is licensed under the MIT License - see the LICENSE.md file for details.
If you use this work in your research, please cite:
Kota, G. (2023). Regression analysis of US employee wages using Azure ML. GitHub repository, https://github.com/GaneshKotaSLU/Azure-ML---US-Wage-Regression-Analysis
The below are few of the technologies used in this project.
- Python 3.8+
- Azure Machine Learning Studio
- LightGBM
- Tree Based Models
- Pandas
- Scikit-learn
- Matplotlib
- LightGBM
This project can further be enahced by incorporating some more valuable information like the employees' domain, country, ... etc and can be fully hosted on live data if the cloud subscruption is active.
Support our work by starring our GitHub repository. For any questions or suggestions, please open an issue in the repository.
This comprehensive README provides a detailed overview of your project, its methodology, results, and future directions. It includes all the sections we discussed earlier, with placeholders for specific results and findings that you can fill in with your actual data. The structure is designed to be informative for both technical and non-technical readers, making your project more accessible and encouraging collaboration.