Course project for Georgia Tech Data Mining & Visual Analytics CSE6242. All data engineering, analytics, modeling, and visualization scripts and sample outputs are included in this repository. Goal of this project is to indentify medical providers that are over-scripting opioids and possibly contributing to the opioid epidemic.
Presentation deliverables:
Analysis unfolds in these steps:
- Access CMS API to pull provider level data and key variables (1.1 million records)
- Preprocess data into a clean training dataset, extract features from provider attributes
- Train models to predict opioid prescription rate, opioid day's supply based on provider attributes
- Calculate residuals, feature value contributions for each prediction and append to original dataset
- Segment data set to high residual outliers, cluster and visualize results
All data comes from the Socrata API which provides provider level data and a host of key variables. Data was stored in a SQLite database, indexed and fragmented to allow for cross joining original dataset, prediction dataset, and analytics dataset - all to provide a data layer for a application to visualize the results of the analysis.
Preprocessed and clean all 1.1 million plus provider records that allows for predictive modeling and analytics. Feature engineering was completed to handle variables with 50+ levels using vtreat
style methods. Other preprocessing such as one hot encoding, variance filters and other recursive feature engineering were used to develop the final training and analysis training set. These methods reduced the feature-space from 500+ possible variables to a rich information set of around 50 features.
Data preprocessing scripts are all available in the data
directory including accessing the Socrata API, data cleaning and preprocessing.
Goal of this project was to build a highly skilled model that predicts provider level Opioid Prescription Rate
and Opioid Days Supply
using provider meta-data, patient meta-data, and prescription details. A extremely accurate model will allow us to confidently examine the residual outliers - providers with prescription rates above their predicted expected amonuts. With a highly skilled model, we can be sure the residuals are not model based, but influenced by some other factors.
Final models for both targets performed extremely well:
Opioid Prescription Rate Model
: Mean Absolute Error: 1.5% on average
Opioid Days Supply Model
: Mean Absolute Error: 29% on average
Techniques applied, see model
directory for details:
Boosted Tree Methods
Bayesian Hyperparameter Optimization
Shapely Model Prediction Inference
Results from the model exploratory data analysis provided for some interesting insights:
- Registered Nurses have highest prescription rate residuals across all credentials
- Dermatology, Neurology have the highest residual prescription rates across all specialties
- Access CMS API to pull provider level data and key variables (1.1 million records)
Outliers were analyzed using various advanced clustering techniques such as DB-SCAN. Model provided segmentation into 5 interpretable clusters based on key variables such as opioid claim count, beneficiary features, day's supply, prescription rate, credentials, specialty and region. Through the 5 clusters - credentials, specialty and region of the provider were the most influential features in separating the clusters.
Shapely analysis helped analyze our model's predictions to provide observation level inference on our outliers. According to this analysis, the number of beneficaries age less than 65 count. Currently in order to qualify for Medicare benefits before 65 the patient must have a disability.
All analysis and model results were packaged into a simple user interface to allow end-users to explore the results. This dashboard (hosted in Tableau) allows users to examine outliers across:
- Outlier Location: where are the outlier located geographically?
- Who are they?: provider and patient level meta-data for selected outlier
- Why are they an outlier?: features driving model predictions for both days supply and prescription rate models
- How to they compare to other outliers?: clustering analysis
The key point of the analysis if to examine the outliers in opioid prescription rate up and above provider expected rate and days supply based on the provider and patient attributes. We believe that these high residual outliers could be an indicator of practices with unhealthy opioid prescription rates.
In analyzing outliers there were multiple providers that had recently been either convicted of prescription malpractice or in the progress of an investigation. This does not mean that all outliers are in question - but this analysis and application could be a useful tool in curbing the opioid epidemic.
Opioid Malpractice Case Indentified by Nopioid Model
All engineering, analytics, and modeling are performed in Python 3. Main database used to store and index data is SQLite. Python package requirements are listed below. Accessing the CMS API requires you to generate your own key and token to stream in the 2017 Provider Summary Data File
. Final project dashboard is available through Tableau Server. You can download Tableau locally and visualize some proof of concept templates here.
pandas
sodapy # access the Socrata API directly
scipy.stats
numpy
matplotlib.pyplot
seaborn
shap
sqlite3
vtreat
scikit-learn
skopt
pickle
All packages can be installed with a traditional python venv or conda enviroment.
pip install <required package>
conda install <required package>
Order of scripts to reproduce results from our models:
- .data.socrata_api.py # grab data from the CMS API, store locally to upload to sqlite
- .data.sqlite_script.txt # format data into types and index
- .model.opioid_model.py # train rate model
- .model.opioid_days_model.py # train days model
- .model.opioid_predictions.py # append predictions to original
- .model.opioid_days_predictions.py # append predictions to original
Training and prediction on to the original dataset can take a significant amount of time. Google Collab notebooks are provided as a basis to append predictions quicker: intro to google colab notebooks.
Both will require mounting of the database to your Google Drive: how to mount data to google colab notebooks
- .model.opioid_predictions.ipynb # append predictions to original dataset
- .model.opioid_days_model.ipynb # append predicitons to original dataset
To reproduce the clustering results on the outliers:
-.model.clustering_analysis.py # analysis and append to original dataset
Result is now a final dataset used to upload to Tableau and visualize output. Final project output lives in a Tableau dashboard hosted on Tableu server.
Official insights dashboard: Nopioid (may be deprecated, please see presentations for details)
The insights dashboard contains the following analytics:
- geographical location and search of outliers (including provider meta-data)
- Model residual summary and outlier search (including provider meta-data)
- SHAP value summaries of Days Supply Model by individual provider
- SHAP value summaries of Prescription Rate Model by individual provider
- Global clustering analysis of all outlier providers