Training stopped abruptly due to testing??? #41

prakarshupmanyu · 2017-06-24T20:38:17Z

I was training my networks just as described in the README file. Although I did encounter some issues but was able to solve them by exploring the "Issues" section here. However, I am not able to find the reason for the following issue:
During training for 3 classes, after creating the first snapshot at iteration 1000 the training stopped suddenly at Iteration 1333. I have attached the train.log file for anyone to have a look.

train.txt

I am talking about this section:

I0624 06:09:45.533581 3006 solver.cpp:258] Train net output #0: loss-ft = 1.80221e+06 (* 1 = 1.80221e+06 loss)
I0624 06:09:45.533587 3006 solver.cpp:571] Iteration 900, lr = 1e-13
I0624 06:12:00.298842 3006 solver.cpp:242] Iteration 950, loss = 1.58426e+06
I0624 06:12:00.298898 3006 solver.cpp:258] Train net output #0: loss-ft = 1.58426e+06 (* 1 = 1.58426e+06 loss)
I0624 06:12:00.298905 3006 solver.cpp:571] Iteration 950, lr = 1e-13
I0624 06:14:12.962877 3006 solver.cpp:449] Snapshotting to binary proto file models/train_iter_1000.caffemodel
I0624 06:14:19.216183 3006 solver.cpp:734] Snapshotting solver state to binary proto filemodels/train_iter_1000.solverstate
I0624 06:14:25.575865 3006 solver.cpp:242] Iteration 1000, loss = 893369
I0624 06:14:25.575922 3006 solver.cpp:258] Train net output #0: loss-ft = 893369 (* 1 = 893369 loss)
I0624 06:14:25.575929 3006 solver.cpp:571] Iteration 1000, lr = 1e-13
I0624 06:16:41.886140 3006 solver.cpp:242] Iteration 1050, loss = 4.73511e+06
I0624 06:16:41.886204 3006 solver.cpp:258] Train net output #0: loss-ft = 4.73511e+06 (* 1 = 4.73511e+06 loss)
I0624 06:16:41.886211 3006 solver.cpp:571] Iteration 1050, lr = 1e-13
I0624 06:18:51.861776 3006 solver.cpp:242] Iteration 1100, loss = 730141
I0624 06:18:51.861830 3006 solver.cpp:258] Train net output #0: loss-ft = 730141 (* 1 = 730141 loss)
I0624 06:18:51.861836 3006 solver.cpp:571] Iteration 1100, lr = 1e-13
I0624 06:21:15.362941 3006 solver.cpp:242] Iteration 1150, loss = 464482
I0624 06:21:15.363018 3006 solver.cpp:258] Train net output #0: loss-ft = 464482 (* 1 = 464482 loss)
I0624 06:21:15.363026 3006 solver.cpp:571] Iteration 1150, lr = 1e-13
I0624 06:23:37.966629 3006 solver.cpp:242] Iteration 1200, loss = 1.86339e+06
I0624 06:23:37.966707 3006 solver.cpp:258] Train net output #0: loss-ft = 1.86339e+06 (* 1 = 1.86339e+06 loss)
I0624 06:23:37.966717 3006 solver.cpp:571] Iteration 1200, lr = 1e-13
I0624 06:25:49.701017 3006 solver.cpp:242] Iteration 1250, loss = 1.27626e+06
I0624 06:25:49.701078 3006 solver.cpp:258] Train net output #0: loss-ft = 1.27626e+06 (* 1 = 1.27626e+06 loss)
I0624 06:25:49.701086 3006 solver.cpp:571] Iteration 1250, lr = 1e-13
I0624 06:27:53.459908 3006 solver.cpp:242] Iteration 1300, loss = 729330
I0624 06:27:53.459959 3006 solver.cpp:258] Train net output #0: loss-ft = 729330 (* 1 = 729330 loss)
I0624 06:27:53.459964 3006 solver.cpp:571] Iteration 1300, lr = 1e-13
I0624 06:29:15.116145 3006 solver.cpp:346] Iteration 1333, Testing net (#0)
F0624 06:29:15.116205 3006 syncedmem.hpp:30] Check failed: error == cudaSuccess (11 vs. 0) invalid argument
*** Check failure stack trace: ***

It says Testing net (#0) and then check failed. What could be the reason for this? Where can I find the failure stack trace?

Another issue: I used the snapshot "train_iter_1000.caffemodel" to segment a few images but all I got was a completely black image. I understand that a good level segmentation might require more training but to get a completely black image puzzled me. What could be the reason for this?

prakarshupmanyu · 2017-06-24T21:25:49Z

Did some digging. Ran test_model.py (which was throwing error) and this is the stack trace:

I0624 21:20:23.632242 3334 net.cpp:241] conv2_1 does not need backward computation.
I0624 21:20:23.632249 3334 net.cpp:241] pool1 does not need backward computation.
I0624 21:20:23.632256 3334 net.cpp:241] relu1_2 does not need backward computation.
I0624 21:20:23.632261 3334 net.cpp:241] conv1_2 does not need backward computation.
I0624 21:20:23.632266 3334 net.cpp:241] relu1_1 does not need backward computation.
I0624 21:20:23.632272 3334 net.cpp:241] conv1_1 does not need backward computation.
I0624 21:20:23.632279 3334 net.cpp:241] data_input_0_split does not need backward computation.
I0624 21:20:23.632298 3334 net.cpp:284] This network produces output pred-ft
I0624 21:20:23.632330 3334 net.cpp:298] Network initialization done.
I0624 21:20:23.632336 3334 net.cpp:299] Memory required for data: 1194961088
[libprotobuf INFO google/protobuf/io/coded_stream.cc:610] Reading dangerously large protocol message. If the message turns out to be larger than 2147483647 bytes, parsing will be halted for security reasons. To increase the limit (or to disable these warnings), see CodedInputStream::SetTotalBytesLimit() in google/protobuf/io/coded_stream.h.
[libprotobuf WARNING google/protobuf/io/coded_stream.cc:81] The total number of bytes read was 537143453
Traceback (most recent call last):
File "test_model.py", line 106, in
main()
File "test_model.py", line 25, in main
test_net(prototxt, caffemodel.format(iteration_num), images, labels, lut)
File "test_model.py", line 75, in test_net
pa = pixel_accuracy(pred, label)
File "/home/prakarsh_upmanyu23/train-CRF-RNN/py_img_seg_eval/eval_segm.py", line 21, in pixel_accuracy
eval_mask, gt_mask = extract_both_masks(eval_segm, gt_segm, cl, n_cl)
File "/home/prakarsh_upmanyu23/train-CRF-RNN/py_img_seg_eval/eval_segm.py", line 132, in extract_both_masks
gt_mask = extract_masks(gt_segm, cl, n_cl)
File "/home/prakarsh_upmanyu23/train-CRF-RNN/py_img_seg_eval/eval_segm.py", line 156, in extract_masks
masks[i, :, :] = segm == c
ValueError: could not broadcast input array from shape (333,500,3) into shape (333,500)
F0624 21:20:25.663760 3334 syncedmem.hpp:30] Check failed: error == cudaSuccess (11 vs. 0) invalid argument
*** Check failure stack trace: ***
Aborted (core dumped)

Anyone with any idea what is happening?

prakarshupmanyu · 2017-06-24T22:22:19Z

@martinkersner In py_img_seg_eval/eval_segm.py file, the function extract_masks() is being used once on prediction segmentation and once on ground truth label. But the shape of these two are different which is the reason for above mentioned error. Ground truth label is 3 dimensional while the predicted one is 2 dimensional. Could you explain the reason for this?

I think this somehow is related to the completely black output that I am getting.

How can one test their model with it?

martinkersner · 2017-06-25T00:36:31Z

@prakarshupmanyu I am not totally sure, but maybe you didn't prepare labels correctly. Labels should be 1-channel images, not 3-channel images.

Also, I would recommend you to use https://github.com/martinkersner/train-DeepLab since I got better result with this network.

prakarshupmanyu · 2017-06-25T04:53:21Z

Thanks for the prompt reply.

I followed the steps mentioned in README. The directory "train_labels_3_lmdb/" was created as per the documentation.
The prediction by the network is 2D and runs through the code in test_model.py but while processing the ground truth label file, it throws error due to it being 3D. Can you think of anything that might have gone wrong?

I'll further look into it and also try this and get back to you.

prakarshupmanyu · 2017-07-20T20:59:17Z

@martinkersner I tried the same thing. I had to remove the testing part from solver.prototxt to make it work. But I was getting "completely black" output. I think it has to do something with my training data which are grayscale MRI images with black background.

ThienAnh · 2017-09-29T05:32:24Z

@prakarshupmanyu dit you fix problem : F0624 06:29:15.116205 3006 syncedmem.hpp:30] Check failed: error == cudaSuccess (11 vs. 0) invalid argument

prakarshupmanyu · 2017-09-29T05:36:47Z

@ThienAnh If I remember correctly, my test image dimensions were not compatible with the code given here, so I had to make some changes to the code in order to get it running.
But I ended up using my own program because I was still getting a "black output"

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Training stopped abruptly due to testing??? #41

Training stopped abruptly due to testing??? #41

prakarshupmanyu commented Jun 24, 2017 •

edited

Loading

prakarshupmanyu commented Jun 24, 2017

prakarshupmanyu commented Jun 24, 2017

martinkersner commented Jun 25, 2017

prakarshupmanyu commented Jun 25, 2017

prakarshupmanyu commented Jul 20, 2017

ThienAnh commented Sep 29, 2017

prakarshupmanyu commented Sep 29, 2017

Training stopped abruptly due to testing??? #41

Training stopped abruptly due to testing??? #41

Comments

prakarshupmanyu commented Jun 24, 2017 • edited Loading

prakarshupmanyu commented Jun 24, 2017

prakarshupmanyu commented Jun 24, 2017

martinkersner commented Jun 25, 2017

prakarshupmanyu commented Jun 25, 2017

prakarshupmanyu commented Jul 20, 2017

ThienAnh commented Sep 29, 2017

prakarshupmanyu commented Sep 29, 2017

prakarshupmanyu commented Jun 24, 2017 •

edited

Loading