Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature request]: automated SMH submission formatting for emcee #430

Open
MacdonaldJoshuaCaleb opened this issue Dec 17, 2024 · 0 comments
Labels
enhancement Request for improvement or addition of new feature(s). gempyor Concerns the Python core. medium priority Medium priority. operations Refers to specific epi modeling objectives or scenario modeling goals. post-processing Concern the post-processing.

Comments

@MacdonaldJoshuaCaleb
Copy link
Collaborator

MacdonaldJoshuaCaleb commented Dec 17, 2024

Label

enhancement

Priority Label

medium priority

Is your feature request related to a problem? Please describe.

Functionality for formatting output for submission to SMH, note some work should be done to generalize some of the input at the start of the code. This code performs two aggregations, from all individual states up to a us level simulation, and all values in a given week to go from daily to weekly incidence as per SMH submission requirements

Is your feature request related to a new application, scenario round, pathogen? Please describe.

takes care of formatting gempyor_simulate_proposal outputs for SMH submission

Describe the solution you'd like

This comment provides a working example, but the code should be generalized. Note that this code assumes you've stored your output data as outlined in #416 (comment)

import dill
keep_list, state_name, fips = get_seasons_keep('H1N1','01000')
results = {}
scenarios = ['HiVax', 'MedVax', 'LowVax']
strains = ['H1', 'H3']

count = 0
for strain in strains:
    for scenario in scenarios:
        for fip in fips:
            keep_list, state_name, fips = get_seasons_keep('H1N1',fip)
            path = f'./scenario_output/all_results_{strain}_{scenario}_{state_name}.pkl'
            with open(path, 'rb') as f:
                data = dill.load(f)
            results[f'{state_name}_{scenario}_{strain}'] = data
        keys = [key for key in results.keys() if scenario in key and strain in key]
        US = []
        for j in range(100):
            for i, key in enumerate(keys):
                if i == 0:
                    temp = results[key][j] 
                else:
                    temp = temp + results[key][j]
            temp['subpop'] = 'US' 
            US.append(temp)
        results[f'USA_{scenario}_{strain}'] = US

def get_week_number(date_obj, start_date):
    return (date_obj - start_date).days // 7

def get_sim_df(results_list, scenario, group):
    start_date = results_list[0].index.min()
    dates = [get_week_number(date, start_date) for date in results_list[0].index]
    location, vax, strain = scenario.split('_')
    scenario_label = {
        ('HiVax', 'H3'): 'A',
        ('HiVax', 'H1'): 'B',
        ('MedVax', 'H3'): 'C',
        ('MedVax', 'H1'): 'D',
        ('LowVax', 'H3'): 'E',
        ('LowVax', 'H1'): 'F'
    }[(vax, strain)]
    scenario_label = f'{scenario_label}-2024-09-20'
    
    sim_data = {
        'origin_date': [],
        'scenario_id': [],
        'location': [],
        'target': [],
        'horizon': [],
        'age_group': [],
        'value': [],
        'run_grouping': [],
        'stochastic_run': []
    }
    
    for i, result in enumerate(results_list):
        grouped = result.filter(regex='incidH.*_age').T.groupby(lambda x: x.split('_')[-1]).sum().T
        grouped['age0to130'] = grouped.sum(axis=1)
        grouped.columns = [col.replace('age0to4', '0-4').replace('age5to17', '5-17').replace('age18to49', '18-49')
                           .replace('age50to64', '50-64').replace('age65to100', '65-130').replace('age0to130', '0-130')
                           for col in grouped.columns]
        for age_group in grouped.columns:
            sim_data['origin_date'].extend(['2024-07-28'] * len(dates))
            sim_data['scenario_id'].extend([scenario_label] * len(dates))
            sim_data['location'].extend([result['subpop'].values[0][:2]] * len(dates))
            sim_data['target'].extend(['inc hosp'] * len(dates))
            sim_data['horizon'].extend(dates)
            sim_data['age_group'].extend([age_group] * len(dates))
            sim_data['value'].extend(grouped[age_group].values)
            sim_data['run_grouping'].extend([i + 1] * len(dates))
            sim_data['stochastic_run'].extend([1] * len(dates))
    return pd.DataFrame(sim_data)

def compile_sim_df(results_dict):
    sim_dfs = []
    group = 1
    for key, results_list in results_dict.items():
        sim_dfs.append(get_sim_df(results_list, key, group % 51 or 51))
        group += 1
    return pd.concat(sim_dfs, ignore_index=True)

def aggregate_sim_df(results_dict):
    res_df = compile_sim_df(results_dict)
    res_df = res_df[res_df['horizon'] > 0].reset_index(drop=True)
    grouped = res_df.groupby(['origin_date', 'scenario_id', 'location', 'target', 'horizon',
                              'age_group', 'run_grouping', 'stochastic_run'], as_index=False)['value'].sum()
    return grouped

#############################
# usage 

formatted = aggregate_sim_df(results)
# no quantiles just simulations 
formatted['output_type'] = 'sample'
formatted['output_type_id'] = np.NaN

# get columns in correct order 
formatted = formatted[['origin_date', 'scenario_id', 'location', 'target', 'horizon',
       'age_group', 'run_grouping', 'stochastic_run','output_type',
       'output_type_id', 'value']]

# save output to appropriate format 
from fastparquet import write
write('2024-08-11-ACCIDDA-FlepiMop-sample.gz.parquet', formatted, compression='GZIP')
@MacdonaldJoshuaCaleb MacdonaldJoshuaCaleb added enhancement Request for improvement or addition of new feature(s). gempyor Concerns the Python core. post-processing Concern the post-processing. operations Refers to specific epi modeling objectives or scenario modeling goals. labels Dec 17, 2024
@TimothyWillard TimothyWillard added the medium priority Medium priority. label Dec 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Request for improvement or addition of new feature(s). gempyor Concerns the Python core. medium priority Medium priority. operations Refers to specific epi modeling objectives or scenario modeling goals. post-processing Concern the post-processing.
Projects
None yet
Development

No branches or pull requests

2 participants