This repository contains source code for paper An analogy based framework for patient-stay identification in healthcare submitted to the ICCBR-ATA 2022 Workshop and all the work carried out in the M2 thesis titled Exploring Analogy based Applications in Healthcare. In this thesis, we focus on two tasks: (1) patient-stay identification, i.e., does a hospital stay belong to a patient or not?, using our first setting, and (2) disease prognosis, i.e., will a certain disease develop in the same way in two distinct patients?, using our second and third settings. We propose a prototypical architecture that combines patient-stay representation learning and the analogical reasoning framework. We train a neural model to detect patient-stay analogies. Our models are implemented using PyTorch.
MIMIC-III database analyzed in the study is available on PhysioNet repository. Here are some steps to prepare for the dataset:
- To request access to MIMIC-III, please follow https://mimic.physionet.org/gettingstarted/access/. Make sure to place
.csv files
underdata/mimic
. - With access to MIMIC-III, to build the MIMIC-III dataset locally using Postgres, follow the instructions at https://github.com/MIT-LCP/mimic-code/tree/master/buildmimic/postgres.
- Run SQL queries to generate necessary views, please follow https://github.com/onlyzdd/clinical-fusion/tree/master/query.
Install Anaconda (or miniconda to save storage space).
Then, create a conda environement (for example stay-analogy) and install the dependencies, using the following commands:
$ conda create --name stay-analogy python=3.9
$ conda activate stay-analogy
$ conda install -y pytorch torchvision torchaudio cudatoolkit=10.2 -c pytorch -c=conda-forge
$ conda install -y numpy scipy pandas scikit-learn
$ conda install -y tqdm gensim nltk
For the Identity setting (T1), run the following codes:
$ python3 T1_dataset.py # build the dataset of triples for the _Identity setting_
$ python3 preprocess01.py # define patient cohort, collect labels, extract temporal signals, and extract clinical notes
$ python3 preprocess02.py # run full preprocessing to obtain dictionaries
$ python3 doc2vec.py --phase train # train doc2vec model
$ python3 doc2vec.py --phase infer # infer doc2vec vectors
To train and evaluate the classification and the corresponding embedding model on structured and unstructured data, run:
$ python3 Identity/train_cnn_both.py # train classifier model together with the embedding model
$ python3 Identity/evaluate_cnn_both.py # evaluate a classifier with the corresponding embedding model
To train and evaluate the classification and the corresponding embedding model on only unstructured data, run:
$ python3 Identity/train_cnn_con.py # train classifier model together with the embedding model
$ python3 Identity/evaluate_cnn_con.py # evaluate a classifier with the corresponding embedding model
To train and evaluate the classification and the corresponding embedding model on only structured data
$ python3 Identity/train_cnn_demo.py # train classifier model together with the embedding model
$ python3 Identity/evaluate_cnn_demo.py # evaluate a classifier with the corresponding embedding model
For the Identity + Sequent setting (T2), run the following codes:
$ python3 T2&T3_dataset.py # build the dataset for the _Identity + Sequent setting_
# Before using the next four commands, make sure to change the file path in the script depending on the diagnosis level you are exploring, _e.g._, processed_icd_T2, processed_cat_T2, or processed_blk_T2. Do the same for the mimic file path, _e.g._, mimic_icd_T2, mimic_cat_T2, or mimic_blk_T2.
$ python3 preprocess01.py
$ python3 preprocess02.py # run full preprocessing to obtain dictionaries
$ python3 doc2vec.py --phase train # train doc2vec model
$ python3 doc2vec.py --phase infer # infer doc2vec vectors
To train and evaluate the classification and the corresponding embedding model on structured and unstructured data
For level 4 code, run the following:
$ python3 Identity+Sequent/T2_train_both_icd.py # train classifier model together with the embedding model
$ python3 Identity+Sequent/T2_evaluate_both_icd.py # evaluate a classifier with the corresponding embedding model
For level 3 code, run the following:
$ python3 Identity+Sequent/T2_train_both_cat.py # train classifier model together with the embedding model
$ python3 Identity+Sequent/T2_evaluate_both_cat.py # evaluate a classifier with the corresponding embedding model
For level 2 code, run the following:
$ python3 Identity+Sequent/T2_train_both_blk.py # train classifier model together with the embedding model
$ python3 Identity+Sequent/T2_evaluate_both_blk.py # evaluate a classifier with the corresponding embedding model
To train and evaluate the classification and the corresponding embedding model on only unstructured data
For level 4 code, run the following:
$ python3 Identity+Sequent/T2_train_con_icd.py # train classifier model together with the embedding model
$ python3 Identity+Sequent/T2_evaluate_con_icd.py # evaluate a classifier with the corresponding embedding model
For level 3 code, run the following:
$ python3 Identity+Sequent/T2_train_con_cat.py # train classifier model together with the embedding model
$ python3 Identity+Sequent/T2_evaluate_con_cat.py # evaluate a classifier with the corresponding embedding model
For level 2 code, run the following:
$ python3 Identity+Sequent/T2_train_con_blk.py # train classifier model together with the embedding model
$ python3 Identity+Sequent/T2_evaluate_con_blk.py # evaluate a classifier with the corresponding embedding model
For the Identity + Directly Sequent setting (T3), run the following codes:
$ python3 T2&T3_dataset.py # build the dataset for the _Identity + Directly Sequent setting_
# Before using the next four commands, make sure to change the file path in the script depending on the diagnosis level you are exploring, _e.g._, processed_icd_T3, processed_cat_T3, or processed_blk_T3. Do the same for the mimic file path, _e.g._, mimic_icd_T3, mimic_cat_T3, or mimic_blk_T3.
$ python3 preprocess01.py
$ python3 preprocess02.py # run full preprocessing to obtain dictionaries
$ python3 doc2vec.py --phase train # train doc2vec model
$ python3 doc2vec.py --phase infer # infer doc2vec vectors
To train and evaluate the classification and the corresponding embedding model on structured and unstructured data
For level 4 code, run the following:
$ python3 Identity+DSequent/T3_train_both_icd.py # train classifier model together with the embedding model
$ python3 Identity+DSequent/T3_evaluate_both_icd.py # evaluate a classifier with the corresponding embedding model
For level 3 code, run the following:
$ python3 Identity+DSequent/T3_train_both_cat.py # train classifier model together with the embedding model
$ python3 Identity+DSequent/T3_evaluate_both_cat.py # evaluate a classifier with the corresponding embedding model
For level 2 code, run the following:
$ python3 Identity+DSequent/T3_train_both_blk.py # train classifier model together with the embedding model
$ python3 Identity+DSequent/T3_evaluate_both_blk.py # evaluate a classifier with the corresponding embedding model
To train and evaluate the classification and the corresponding embedding model on only unstructured data
For level 4 code, run the following:
$ python3 Identity+DSequent/T3_train_con_icd.py # train classifier model together with the embedding model
$ python3 Identity+DSequent/T3_evaluate_con_icd.py # evaluate a classifier with the corresponding embedding model
For level 3 code, run the following:
$ python3 Identity+DSequent/T3_train_con_cat.py # train classifier model together with the embedding model
$ python3 Identity+DSequent/T3_evaluate_con_cat.py # evaluate a classifier with the corresponding embedding model
For level 2 code, run the following:
$ python3 Identity+DSequent/T3_train_con_blk.py # train classifier model together with the embedding model
$ python3 Identity+DSequent/T3_evaluate_con_blk.py # evaluate a classifier with the corresponding embedding model
data.py
: tools to load the dataset, contains the main dataset classTask1Dataset
and the data augmentation functions for the Identity settingT2_data_blk.py
: tools to load the dataset, contains the main dataset classTask1Dataset
and the data augmentation functions for the Identity + Sequent setting in the level 2 codeT2_data_cat.py
: tools to load the dataset, contains the main dataset classTask1Dataset
and the data augmentation functions for the Identity + Sequent setting in the level 3 codeT2_data_icd.py
: tools to load the dataset, contains the main dataset classTask1Dataset
and the data augmentation functions for the Identity + Sequent setting in the level 4 codeT3_data_blk.py
: tools to load the dataset, contains the main dataset classTask1Dataset
and the data augmentation functions for the Identity + Directly Sequent setting in the level 2 codeT3_data_cat.py
: tools to load the dataset, contains the main dataset classTask1Dataset
and the data augmentation functions for the Identity + Directly Sequent setting in the level 3 codeT3_data_icd.py
: tools to load the dataset, contains the main dataset classTask1Dataset
and the data augmentation functions for the Identity + Directly Sequent setting in the level 4 codeanalogy_classif_both.py
: neural network to classify analogies for structured and unstructured dataanalogy_classif_con.py
: neural network to classify analogies for unstructured dataanalogy_classif_demo.py
: neural network to classify analogies for structured datacnn_con.py
: neural network to embed clinical notescnn_dem.py
: neural network to embed static informationcnn_both.py
: neural network to embed static information and clinical notesutils.py
: tools for the different codesIdentity
folder: contains all training and evaluating codes for Identity settingIdentity+Sequent
folder: contains all training and evaluating codes for Identity + Sequent settingIdentity+DSequent
folder: contains all training and evaluating codes for Identity + Directly Sequent setting