This is a tool for working with the Community Health Status Indicators dataset.
Most of the functionality is in data_handler.py, which defines the CSHIDataHandler
class, which is initialized with the following parameters:
data_dir
- the path to the CHSI datasetdependent
- the name of the dependent variableexclude_cols
- a list of columns to be excludedthreshold
- the proportion of values which must be present to include a predictor
Once initialized, the training_data()
method returns a tuple X,Y
, a Pandas DataFrame and Series which hold the predictors and dependent variable with various cleaning already completed.
An attempt is made to ensure that entries within columns are comparable.
This means converting absolute counts to per-capita rates and similar adjustments for time period and land area as applicable.
Missing values are also imputed.
Other useful methods include:
data_element
- looks up an indicator in the DATAELEMENTDESCRIPTION.csv file.export_data
- exports indicator data file to CSV. Extra columns (e.g. predicted values of the dependent variable) can be included by specifying a Pandas DataFrame for theextra_columns
parameter.all_county_data
- returns all county-level data in a single DataFramestate_us_averages
- computes population-weighted averages for a list of columns, on the state and national levels.
There are also shorthand methods for retrieving data from a given "page" (csv file), e.g. mbd
for "MEASURESOFBIRTHANDDEATH".
The notebook weighted_ridge_regression.ipynb models self-reported health status using scikit-learn's implementation of ridge regression.
You can also explore the dataset visually with my map.