This project analyzes patient satisfaction data from various healthcare facilities across the United States using HCAHPS (Hospital Consumer Assessment of Healthcare Providers and Systems) scores. The data is visualized through an interactive dashboard, displaying hospital ratings and comparisons of various factors such as communication with nurses, patient survey star ratings, and survey response rates.
The project also attempts to apply linear regression to understand relationships between different HCAHPS measures and overall patient satisfaction but encountered challenges due to data limitations.
- Interactive Map of Hospital Ratings: A choropleth map displays hospital ratings by state, allowing users to see which areas have the highest-rated facilities based on HCAHPS Linear Mean Values.
- Interactive Scatter Plot: This scatter plot visualizes the relationship between communication with nurses and overall hospital satisfaction, enabling exploration of how specific measures impact patient perceptions.
- Top-Performing Hospitals: The top hospitals by HCAHPS Linear Mean Value are highlighted, providing insight into facilities with the best patient feedback.
- Linear Regression Analysis: Attempts were made to apply linear regression to predict hospital satisfaction ratings, but the results were inconclusive due to data quality issues (see details below).
- Data Import and Cleanup: The dataset is cleaned by removing rows with missing values in key columns such as
HCAHPS Linear Mean Value
andHCAHPS Answer Percent
. - Aggregation: Hospitals are aggregated by
Facility Name
to ensure that multiple records from the same hospital are summarized together.
- Choropleth Map: Displays HCAHPS Linear Mean Value across the U.S., allowing users to hover over hospitals to see their details such as facility name and state.
- Scatter Plot: Displays the correlation between communication scores and overall patient satisfaction using the
HCAHPS Answer Percent
andHCAHPS Linear Mean Value
columns.
- Hospital Information Lookup: A custom function allows users to input a hospital number to retrieve detailed information, such as the hospital's address, rating, number of surveys completed, and survey response rate. Footnotes provide further context on the ratings and results, helping users draw conclusions.
The goal was to fit a linear regression model to predict the HCAHPS Linear Mean Value
based on various factors like HCAHPS Answer Percent
, Survey Response Rate Percent
, and other patient satisfaction metrics.
- Data Inconsistency: Many of the key columns, such as
HCAHPS Linear Mean Value
andHCAHPS Answer Percent
, contained a large number of missing or non-numeric values (e.g.,'Not Applicable'
). Even after cleaning the dataset by converting or removing non-numeric values, the dataset became significantly smaller. - Insufficient Data for Training: After cleaning, the number of valid samples that could be used to train the model was very small. Linear regression models rely on a sufficiently large dataset to accurately predict relationships between variables. With too few data points, the model cannot generalize well or find meaningful patterns.
- Low Variability: The data that remained after cleaning showed very little variability in certain columns. For instance, many hospitals had very similar
HCAHPS Linear Mean Value
scores andHCAHPS Answer Percent
values, which limited the potential for the model to differentiate between different hospitals or accurately predict outcomes. - Collinearity Issues: There was potential for multicollinearity in the dataset, where some predictor variables were highly correlated with each other. For example, the survey response rate might be related to other patient satisfaction metrics, making it difficult for the model to isolate the individual effects of each predictor.
Due to the large amount of missing data, low variability in key columns, and potential multicollinearity, the linear regression model was not able to find meaningful patterns or fit the data well. In future work, additional data preprocessing, collection of more diverse samples, or the use of more advanced models (such as regularized regression or machine learning algorithms) might help improve predictive accuracy.
This analysis highlights the importance of having high-quality, complete data when building statistical models and emphasizes the need for careful data preprocessing and cleaning.
- Run the code in a Jupyter Notebook or Python environment.
- Follow prompts to input a hospital number to retrieve detailed analysis or explore the interactive visualizations.
- Use the map and scatter plot to visually analyze trends in patient satisfaction scores across the U.S.
- Pandas: For data manipulation and cleaning.
- Plotly Express: For creating interactive visualizations like the choropleth map and scatter plot.
- Scikit-Learn: Used for regression modeling.
- NumPy: For numerical operations and handling arrays.
- Matplotlib: Used for visualizing data and creating static plots.
- Seaborn: Used for creating enhanced, aesthetically pleasing statistical plots.
- Additional Metrics: Incorporate more HCAHPS measures or external data sources to create a more comprehensive view of hospital performance.
- Advanced Machine Learning Models: Implement regression or clustering algorithms to further explore relationships in the data.
- Python 3.x
- Pandas
- Plotly
- Scikit-learn
- NumPy
- Matplotlib
- Seaborn
- Clone the repository or download the juypter notebook.
- Download data from: https://data.cms.gov/provider-data/dataset/dgck-syfz#data-table
- Install the required dependencies using:
pip install -r requirements.txt
- Run the notebook or script to explore the data and visualize hospital performance across the United States.