-
Notifications
You must be signed in to change notification settings - Fork 92
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Training stopped abruptly due to testing??? #41
Comments
Did some digging. Ran test_model.py (which was throwing error) and this is the stack trace: I0624 21:20:23.632242 3334 net.cpp:241] conv2_1 does not need backward computation. Anyone with any idea what is happening? |
@martinkersner In py_img_seg_eval/eval_segm.py file, the function extract_masks() is being used once on prediction segmentation and once on ground truth label. But the shape of these two are different which is the reason for above mentioned error. Ground truth label is 3 dimensional while the predicted one is 2 dimensional. Could you explain the reason for this? I think this somehow is related to the completely black output that I am getting. How can one test their model with it? |
@prakarshupmanyu I am not totally sure, but maybe you didn't prepare labels correctly. Labels should be 1-channel images, not 3-channel images. Also, I would recommend you to use https://github.com/martinkersner/train-DeepLab since I got better result with this network. |
Thanks for the prompt reply. I followed the steps mentioned in README. The directory "train_labels_3_lmdb/" was created as per the documentation. I'll further look into it and also try this and get back to you. |
@martinkersner I tried the same thing. I had to remove the testing part from solver.prototxt to make it work. But I was getting "completely black" output. I think it has to do something with my training data which are grayscale MRI images with black background. |
@prakarshupmanyu dit you fix problem : F0624 06:29:15.116205 3006 syncedmem.hpp:30] Check failed: error == cudaSuccess (11 vs. 0) invalid argument |
@ThienAnh If I remember correctly, my test image dimensions were not compatible with the code given here, so I had to make some changes to the code in order to get it running. |
I was training my networks just as described in the README file. Although I did encounter some issues but was able to solve them by exploring the "Issues" section here. However, I am not able to find the reason for the following issue:
During training for 3 classes, after creating the first snapshot at iteration 1000 the training stopped suddenly at Iteration 1333. I have attached the train.log file for anyone to have a look.
train.txt
I am talking about this section:
I0624 06:09:45.533581 3006 solver.cpp:258] Train net output #0: loss-ft = 1.80221e+06 (* 1 = 1.80221e+06 loss)
I0624 06:09:45.533587 3006 solver.cpp:571] Iteration 900, lr = 1e-13
I0624 06:12:00.298842 3006 solver.cpp:242] Iteration 950, loss = 1.58426e+06
I0624 06:12:00.298898 3006 solver.cpp:258] Train net output #0: loss-ft = 1.58426e+06 (* 1 = 1.58426e+06 loss)
I0624 06:12:00.298905 3006 solver.cpp:571] Iteration 950, lr = 1e-13
I0624 06:14:12.962877 3006 solver.cpp:449] Snapshotting to binary proto file models/train_iter_1000.caffemodel
I0624 06:14:19.216183 3006 solver.cpp:734] Snapshotting solver state to binary proto filemodels/train_iter_1000.solverstate
I0624 06:14:25.575865 3006 solver.cpp:242] Iteration 1000, loss = 893369
I0624 06:14:25.575922 3006 solver.cpp:258] Train net output #0: loss-ft = 893369 (* 1 = 893369 loss)
I0624 06:14:25.575929 3006 solver.cpp:571] Iteration 1000, lr = 1e-13
I0624 06:16:41.886140 3006 solver.cpp:242] Iteration 1050, loss = 4.73511e+06
I0624 06:16:41.886204 3006 solver.cpp:258] Train net output #0: loss-ft = 4.73511e+06 (* 1 = 4.73511e+06 loss)
I0624 06:16:41.886211 3006 solver.cpp:571] Iteration 1050, lr = 1e-13
I0624 06:18:51.861776 3006 solver.cpp:242] Iteration 1100, loss = 730141
I0624 06:18:51.861830 3006 solver.cpp:258] Train net output #0: loss-ft = 730141 (* 1 = 730141 loss)
I0624 06:18:51.861836 3006 solver.cpp:571] Iteration 1100, lr = 1e-13
I0624 06:21:15.362941 3006 solver.cpp:242] Iteration 1150, loss = 464482
I0624 06:21:15.363018 3006 solver.cpp:258] Train net output #0: loss-ft = 464482 (* 1 = 464482 loss)
I0624 06:21:15.363026 3006 solver.cpp:571] Iteration 1150, lr = 1e-13
I0624 06:23:37.966629 3006 solver.cpp:242] Iteration 1200, loss = 1.86339e+06
I0624 06:23:37.966707 3006 solver.cpp:258] Train net output #0: loss-ft = 1.86339e+06 (* 1 = 1.86339e+06 loss)
I0624 06:23:37.966717 3006 solver.cpp:571] Iteration 1200, lr = 1e-13
I0624 06:25:49.701017 3006 solver.cpp:242] Iteration 1250, loss = 1.27626e+06
I0624 06:25:49.701078 3006 solver.cpp:258] Train net output #0: loss-ft = 1.27626e+06 (* 1 = 1.27626e+06 loss)
I0624 06:25:49.701086 3006 solver.cpp:571] Iteration 1250, lr = 1e-13
I0624 06:27:53.459908 3006 solver.cpp:242] Iteration 1300, loss = 729330
I0624 06:27:53.459959 3006 solver.cpp:258] Train net output #0: loss-ft = 729330 (* 1 = 729330 loss)
I0624 06:27:53.459964 3006 solver.cpp:571] Iteration 1300, lr = 1e-13
I0624 06:29:15.116145 3006 solver.cpp:346] Iteration 1333, Testing net (#0)
F0624 06:29:15.116205 3006 syncedmem.hpp:30] Check failed: error == cudaSuccess (11 vs. 0) invalid argument
*** Check failure stack trace: ***
It says Testing net (#0) and then check failed. What could be the reason for this? Where can I find the failure stack trace?
Another issue: I used the snapshot "train_iter_1000.caffemodel" to segment a few images but all I got was a completely black image. I understand that a good level segmentation might require more training but to get a completely black image puzzled me. What could be the reason for this?
The text was updated successfully, but these errors were encountered: