-
Notifications
You must be signed in to change notification settings - Fork 324
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to use the embedding on a new categorical data #24
Comments
Hi isaac2lord, This is a very interesting topic! I have little knowledge in medicine but I will try my best to ansewr your question so please correct me if I make some wrong assumptions. To learn the embeddings of patient and ICD10 with the 2-col data you have you need to give the network a meaningful problem to answer. One problem I can think of is "If the patient has diagnosese A, B and C what is the probability that he also has diagnoses D?" (I assume you don't have the date when the diagnoses are made so there is no order of A, B, C, D. If you do have the dates and together with other infomation such as age/gender etc. that will make the learned embeddings more meaningful.) I would solve the problem with the following steps:
Now you can just feed the data you have and let the network try to guess the other diagnoses. If the patient has less than 4 diagnoses you can use a special code to represent it. For example instead to use [A, B, C] as input you use [A, nd, nd] as input. One point your data is different with NLP context is there is no order as I assumed above. You you can permute your data to generate more training samples. For example for patient with diagnose A, B, C you will have the following traning data points: A, B, nd -> C |
You don't need one-hot encodings. You just map each patient to an N-dimentional vector and map each ICD10 code plus another nd code to a M-dimentional vector. You can try different N and M to get the best prediction results. I would start with something meaningful say if you want to describe a patient/diagnose normally how many factors do you need? Of course it also depends on how much data you have. The more data you have the higher dimention l you can afford to try. I would start with something like N=20 and M=20. |
Thanks much @entron for sharing your thoughts. I shall say the data contains Diag_Date as well, but not a lot of patient info that we can really leverage (it's because we are not allowed to use any member demographic fields in our modeling efforts). Here is a view of data There are about 80K patients with a max of 23 diagnosis codes. Thanks for pointing me to the paper, and yes I am familiar with word2vec and it's many variants. So assume I reformat my data based on above steps ( with ordering ), is the entity embedding would perform similar approaches as generative NLP models, trying to predict the next word in a part of speech? |
If the date are not really usable it is fine I think to not use it and still get meanful embeddings. |
Hello,
I have a general rudimentary question ( sorry in advance).
I have reviewed (not fully) many parts of the codes in here. I'd like to test the proposed embedding on a new data, but am not sure where to begin.
I have a simple 2-column data: first col is patient id (assume 1M unique patients) second col is ICD10 diag code (assume 10K categories). We have repeated measurements in data, meaning that diagnoses can be repeated within a given patient and across many patients.
I tested Multiple Correspondance Analysis with categorical data from this link, but the results are not very useful.
Similar to the German States example in the repo, my goal is to perform (unsupervised) dimensionality reduction ( such as the ones you'd see in denoising AE with minimizing reconstruction error).
Appreciate any words of wisdom you may be able to share.
The text was updated successfully, but these errors were encountered: