Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

preprocessing not working correctly #3

Open
AndreLamurias opened this issue Jul 30, 2019 · 6 comments
Open

preprocessing not working correctly #3

AndreLamurias opened this issue Jul 30, 2019 · 6 comments

Comments

@AndreLamurias
Copy link
Member

at the moment the preprocessing step is not generating the correct output, and the trained model obtains a low performance. Meanwhile I have uploaded the dditrain and dditest files that you can move to the temp/ directory to train the model: https://drive.google.com/drive/folders/1wKfdeLGm9x4PbmfkYj9Iz8S7jZZz8PUJ?usp=sharing

@mjlorenzo305
Copy link

Thanks Andre @AndreLamurias
I was able to get the preprocessing step to complete. I previously thought it was hanging, but it turns out it takes a very long time to complete. I set logging to INFO level to help indicate that progress was occurring (slowly) during the get ddi sdp instances steps.

Once preprocessing completed I verified that the dditrain numpy arrays contained data which I used to then train the model. As you noted above, the training of the full model produced low performance. (Model converges at around .45 F1 on test set after 40 epochs)

I'll try out your pre-processed dataset above and re-train. I'll let you know.
Thanks, Mario

@mjlorenzo305
Copy link

@AndreLamurias

I tried the provided preprocessed file by placing them in temp (and moving my previously generated file). After invoking the train process it fails as follows:

Traceback (most recent call last):
File "src/train_rnn.py", line 832, in
main()
File "src/train_rnn.py", line 570, in main
train(sys.argv[3], sys.argv[4:], train_inputs, id_to_index)
File "src/train_rnn.py", line 397, in train
inputs, w2v_layer, wn_index = prepare_inputs(channels, train_inputs, list_order, id_to_index)
File "src/train_rnn.py", line 349, in prepare_inputs
X_ids_left = preprocess_ids(X_subpaths_train[0], id_to_index, max_ancestors_length)
File "src/train_rnn.py", line 204, in preprocess_ids
idxs = [id_to_index[d.replace("", ":")] for d in seq if d and d.startswith("CHEBI")]
File "src/train_rnn.py", line 204, in
idxs = [id_to_index[d.replace("
", ":")] for d in seq if d and d.startswith("CHEBI")]
KeyError: 'CHEBI:32134'

I ran it using the following command:
python src/train_rnn.py train temp/dditrain full_model words wordnet common_ancestors concat_ancestors

@AndreLamurias
Copy link
Member Author

This is due to the different versions of the chebi ontology. The ID of that compound was updated since we generated those files. I will open another issue so that "alt_id" field is also considered.

For future reference, we used this version of the chebi ontology: ftp://ftp.ebi.ac.uk/pub/databases/chebi/archive/rel158/

@mjlorenzo305
Copy link

mjlorenzo305 commented Aug 6, 2019

Thanks @AndreLamurias
I was able to complete the training using the above mentioned version of chebi obo along with the provided set of preprocessed data (numpy arrays).

I observed some improvements in model performance of val_f1 at .60 but still not as high as expected after 100 epochs. Convergence occurs at around 30 epochs.

Any thoughts or ideas on what other param tuning is required?

Thanks, Mario

Here is the summary for the 100th Epoch

Epoch 100/100

  • 244s - loss: 0.0268 - acc: 0.9902 - precision: 0.9745 - recall: 0.9702 - f1: 0.9714 - val_loss: 0.6955 - val_acc: 0.8616 - val_precision: 0.6170 - val_recall: 0.5524 - val_f1: 0.5732

predicted not false: 1372/1537
[[5945 133 180 74 9]
[ 214 268 27 8 0]
[ 212 15 383 27 2]
[ 118 8 30 158 1]
[ 17 0 6 1 42]]
VAL_f1: 0.604 VAL_p: 0.653 VAL_r 0.564
Epoch 00100: val_loss did not improve from 0.39073

@mjlorenzo305
Copy link

Following up on my last comment:
Looks like I confused the DDI Detection with the DDI Classification task. The model I trained above was for DDI classification and therefore the val_f1 matches (or slightly better) than the performance reported in the BOLSTM paper. (Correct me if I am mistaken)

@AndreLamurias
Copy link
Member Author

@mjlorenzo305 yes those scores are for the DDI classification task

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants