[Feature request]: automated SMH submission formatting for emcee #430

MacdonaldJoshuaCaleb · 2024-12-17T19:18:10Z

Label

enhancement

Priority Label

medium priority

Is your feature request related to a problem? Please describe.

Functionality for formatting output for submission to SMH, note some work should be done to generalize some of the input at the start of the code. This code performs two aggregations, from all individual states up to a us level simulation, and all values in a given week to go from daily to weekly incidence as per SMH submission requirements

Is your feature request related to a new application, scenario round, pathogen? Please describe.

takes care of formatting gempyor_simulate_proposal outputs for SMH submission

Describe the solution you'd like

This comment provides a working example, but the code should be generalized. Note that this code assumes you've stored your output data as outlined in #416 (comment)

import dill
keep_list, state_name, fips = get_seasons_keep('H1N1','01000')
results = {}
scenarios = ['HiVax', 'MedVax', 'LowVax']
strains = ['H1', 'H3']

count = 0
for strain in strains:
    for scenario in scenarios:
        for fip in fips:
            keep_list, state_name, fips = get_seasons_keep('H1N1',fip)
            path = f'./scenario_output/all_results_{strain}_{scenario}_{state_name}.pkl'
            with open(path, 'rb') as f:
                data = dill.load(f)
            results[f'{state_name}_{scenario}_{strain}'] = data
        keys = [key for key in results.keys() if scenario in key and strain in key]
        US = []
        for j in range(100):
            for i, key in enumerate(keys):
                if i == 0:
                    temp = results[key][j] 
                else:
                    temp = temp + results[key][j]
            temp['subpop'] = 'US' 
            US.append(temp)
        results[f'USA_{scenario}_{strain}'] = US

def get_week_number(date_obj, start_date):
    return (date_obj - start_date).days // 7

def get_sim_df(results_list, scenario, group):
    start_date = results_list[0].index.min()
    dates = [get_week_number(date, start_date) for date in results_list[0].index]
    location, vax, strain = scenario.split('_')
    scenario_label = {
        ('HiVax', 'H3'): 'A',
        ('HiVax', 'H1'): 'B',
        ('MedVax', 'H3'): 'C',
        ('MedVax', 'H1'): 'D',
        ('LowVax', 'H3'): 'E',
        ('LowVax', 'H1'): 'F'
    }[(vax, strain)]
    scenario_label = f'{scenario_label}-2024-09-20'
    
    sim_data = {
        'origin_date': [],
        'scenario_id': [],
        'location': [],
        'target': [],
        'horizon': [],
        'age_group': [],
        'value': [],
        'run_grouping': [],
        'stochastic_run': []
    }
    
    for i, result in enumerate(results_list):
        grouped = result.filter(regex='incidH.*_age').T.groupby(lambda x: x.split('_')[-1]).sum().T
        grouped['age0to130'] = grouped.sum(axis=1)
        grouped.columns = [col.replace('age0to4', '0-4').replace('age5to17', '5-17').replace('age18to49', '18-49')
                           .replace('age50to64', '50-64').replace('age65to100', '65-130').replace('age0to130', '0-130')
                           for col in grouped.columns]
        for age_group in grouped.columns:
            sim_data['origin_date'].extend(['2024-07-28'] * len(dates))
            sim_data['scenario_id'].extend([scenario_label] * len(dates))
            sim_data['location'].extend([result['subpop'].values[0][:2]] * len(dates))
            sim_data['target'].extend(['inc hosp'] * len(dates))
            sim_data['horizon'].extend(dates)
            sim_data['age_group'].extend([age_group] * len(dates))
            sim_data['value'].extend(grouped[age_group].values)
            sim_data['run_grouping'].extend([i + 1] * len(dates))
            sim_data['stochastic_run'].extend([1] * len(dates))
    return pd.DataFrame(sim_data)

def compile_sim_df(results_dict):
    sim_dfs = []
    group = 1
    for key, results_list in results_dict.items():
        sim_dfs.append(get_sim_df(results_list, key, group % 51 or 51))
        group += 1
    return pd.concat(sim_dfs, ignore_index=True)

def aggregate_sim_df(results_dict):
    res_df = compile_sim_df(results_dict)
    res_df = res_df[res_df['horizon'] > 0].reset_index(drop=True)
    grouped = res_df.groupby(['origin_date', 'scenario_id', 'location', 'target', 'horizon',
                              'age_group', 'run_grouping', 'stochastic_run'], as_index=False)['value'].sum()
    return grouped

#############################
# usage 

formatted = aggregate_sim_df(results)
# no quantiles just simulations 
formatted['output_type'] = 'sample'
formatted['output_type_id'] = np.NaN

# get columns in correct order 
formatted = formatted[['origin_date', 'scenario_id', 'location', 'target', 'horizon',
       'age_group', 'run_grouping', 'stochastic_run','output_type',
       'output_type_id', 'value']]

# save output to appropriate format 
from fastparquet import write
write('2024-08-11-ACCIDDA-FlepiMop-sample.gz.parquet', formatted, compression='GZIP')

MacdonaldJoshuaCaleb added enhancement Request for improvement or addition of new feature(s). gempyor Concerns the Python core. post-processing Concern the post-processing. operations Refers to specific epi modeling objectives or scenario modeling goals. labels Dec 17, 2024

TimothyWillard added the medium priority Medium priority. label Dec 17, 2024

TimothyWillard added this to the Post-Processing And Scenario Analysis Tools milestone Dec 17, 2024

MacdonaldJoshuaCaleb mentioned this issue Dec 18, 2024

[Feature request]: Add Output Plotting Options Of Panel Figure With Main Scenario Hub Targets #415

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature request]: automated SMH submission formatting for emcee #430

[Feature request]: automated SMH submission formatting for emcee #430

MacdonaldJoshuaCaleb commented Dec 17, 2024 •

edited

Loading

[Feature request]: automated SMH submission formatting for emcee #430

[Feature request]: automated SMH submission formatting for emcee #430

Comments

MacdonaldJoshuaCaleb commented Dec 17, 2024 • edited Loading

Label

Priority Label

Is your feature request related to a problem? Please describe.

Is your feature request related to a new application, scenario round, pathogen? Please describe.

Describe the solution you'd like

MacdonaldJoshuaCaleb commented Dec 17, 2024 •

edited

Loading