Skip to content

Using machine learning and applied analytics to identify high-residual opioid prescribers

Notifications You must be signed in to change notification settings

olivierzach/nopioid

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

35 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Nopioid: Targeting High Residual Opioid Prescription Medical Providers

Course project for Georgia Tech Data Mining & Visual Analytics CSE6242. All data engineering, analytics, modeling, and visualization scripts and sample outputs are included in this repository. Goal of this project is to indentify medical providers that are over-scripting opioids and possibly contributing to the opioid epidemic.

Presentation deliverables:

Project Poster

Final Report

Progress Report

Project Proposal

Project Slides

Analysis unfolds in these steps:

    - Access CMS API to pull provider level data and key variables (1.1 million records)
    - Preprocess data into a clean training dataset, extract features from provider attributes
    - Train models to predict opioid prescription rate, opioid day's supply based on provider attributes
    - Calculate residuals, feature value contributions for each prediction and append to original dataset
    - Segment data set to high residual outliers, cluster and visualize results

Data

All data comes from the Socrata API which provides provider level data and a host of key variables. Data was stored in a SQLite database, indexed and fragmented to allow for cross joining original dataset, prediction dataset, and analytics dataset - all to provide a data layer for a application to visualize the results of the analysis.

Preprocessed and clean all 1.1 million plus provider records that allows for predictive modeling and analytics. Feature engineering was completed to handle variables with 50+ levels using vtreat style methods. Other preprocessing such as one hot encoding, variance filters and other recursive feature engineering were used to develop the final training and analysis training set. These methods reduced the feature-space from 500+ possible variables to a rich information set of around 50 features.

Data preprocessing scripts are all available in the data directory including accessing the Socrata API, data cleaning and preprocessing.

Model Experiments and Analytics

Goal of this project was to build a highly skilled model that predicts provider level Opioid Prescription Rate and Opioid Days Supply using provider meta-data, patient meta-data, and prescription details. A extremely accurate model will allow us to confidently examine the residual outliers - providers with prescription rates above their predicted expected amonuts. With a highly skilled model, we can be sure the residuals are not model based, but influenced by some other factors.

Final models for both targets performed extremely well:

Opioid Prescription Rate Model: Mean Absolute Error: 1.5% on average Opioid Days Supply Model: Mean Absolute Error: 29% on average

Techniques applied, see model directory for details:

Boosted Tree Methods Bayesian Hyperparameter Optimization Shapely Model Prediction Inference

Results from the model exploratory data analysis provided for some interesting insights:

    - Registered Nurses have highest prescription rate residuals across all credentials
    - Dermatology, Neurology have the highest residual prescription rates across all specialties
    - Access CMS API to pull provider level data and key variables (1.1 million records)

Outliers were analyzed using various advanced clustering techniques such as DB-SCAN. Model provided segmentation into 5 interpretable clusters based on key variables such as opioid claim count, beneficiary features, day's supply, prescription rate, credentials, specialty and region. Through the 5 clusters - credentials, specialty and region of the provider were the most influential features in separating the clusters.

Shapely analysis helped analyze our model's predictions to provide observation level inference on our outliers. According to this analysis, the number of beneficaries age less than 65 count. Currently in order to qualify for Medicare benefits before 65 the patient must have a disability.

Analytics Application

All analysis and model results were packaged into a simple user interface to allow end-users to explore the results. This dashboard (hosted in Tableau) allows users to examine outliers across:

    - Outlier Location: where are the outlier located geographically?
    - Who are they?: provider and patient level meta-data for selected outlier
    - Why are they an outlier?: features driving model predictions for both days supply and prescription rate models
    - How to they compare to other outliers?: clustering analysis

Results

The key point of the analysis if to examine the outliers in opioid prescription rate up and above provider expected rate and days supply based on the provider and patient attributes. We believe that these high residual outliers could be an indicator of practices with unhealthy opioid prescription rates.

In analyzing outliers there were multiple providers that had recently been either convicted of prescription malpractice or in the progress of an investigation. This does not mean that all outliers are in question - but this analysis and application could be a useful tool in curbing the opioid epidemic.

Opioid Malpractice Case Indentified by Nopioid Model

Prerequisites

All engineering, analytics, and modeling are performed in Python 3. Main database used to store and index data is SQLite. Python package requirements are listed below. Accessing the CMS API requires you to generate your own key and token to stream in the 2017 Provider Summary Data File. Final project dashboard is available through Tableau Server. You can download Tableau locally and visualize some proof of concept templates here.

pandas
sodapy # access the Socrata API directly
scipy.stats
numpy
matplotlib.pyplot
seaborn
shap
sqlite3
vtreat
scikit-learn
skopt
pickle

Installing

All packages can be installed with a traditional python venv or conda enviroment.

pip install <required package>
conda install <required package>

Execution

Order of scripts to reproduce results from our models:

- .data.socrata_api.py                  # grab data from the CMS API, store locally to upload to sqlite
- .data.sqlite_script.txt               # format data into types and index
- .model.opioid_model.py                # train rate model
- .model.opioid_days_model.py           # train days model
- .model.opioid_predictions.py          # append predictions to original
- .model.opioid_days_predictions.py     # append predictions to original

Training and prediction on to the original dataset can take a significant amount of time. Google Collab notebooks are provided as a basis to append predictions quicker: intro to google colab notebooks.

Both will require mounting of the database to your Google Drive: how to mount data to google colab notebooks

- .model.opioid_predictions.ipynb       # append predictions to original dataset
- .model.opioid_days_model.ipynb        # append predicitons to original dataset

To reproduce the clustering results on the outliers:

-.model.clustering_analysis.py          # analysis and append to original dataset

Result is now a final dataset used to upload to Tableau and visualize output. Final project output lives in a Tableau dashboard hosted on Tableu server.

Official insights dashboard: Nopioid (may be deprecated, please see presentations for details)

The insights dashboard contains the following analytics:

    - geographical location and search of outliers (including provider meta-data)
    - Model residual summary and outlier search (including provider meta-data)
    - SHAP value summaries of Days Supply Model by individual provider
    - SHAP value summaries of Prescription Rate Model by individual provider
    - Global clustering analysis of all outlier providers