Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The loss value is really big and drop slowly, is this normal? #4

Open
colorblank opened this issue Sep 17, 2020 · 7 comments
Open

The loss value is really big and drop slowly, is this normal? #4

colorblank opened this issue Sep 17, 2020 · 7 comments

Comments

@colorblank
Copy link

GPU:RTX2080Ti
pytorch:1.5
CUDA:10.2
dataset:amot

loading configure: config_gpu4_amot.json========
{
"dataset_name": "AMOTD",
"dataset_path": "~/data/omot_partial_dataset",
"phase": "train",
"frame_max_input_num": 16,
"frame_sample_scale": 2,
"parameter_frame_scale": 0.25,
"random_select_frame": false,
"min_valid_node_rate": 0.15,
"num_classes": 2,
"cuda": true,
"frame_size": 168,
"pixel_mean": [57, 52, 50],
"num_motion_model_param": 12,
"video_fps": 30.0,
"image_width": 1920,
"image_height": 1080,
"label_map": {
"vehicle": 1
},
"replace_map": {
"vehicle": 1
},
"test": {
"resume": "./test_logs/weights/ssdt67650.pth",
"dataset_type": "train",
"batch_size": 1,
"num_workers": 1,
"lr_decay_per_epoch": [1, 30, 45, 50],
"base_net_weights": null,
"log_save_folder": "./logs/test_logs/logs",
"image_save_folder": "./logs/test_logs/images",
"weights_save_folder": "./logs/test_logs/weights",
"sequence_list": "./dataset/amot/sequence_list_town02_train_part.txt",
"save_weight_per_epoch": 5,
"start_epoch": 0,
"end_epoch": 55,
"tensorboard": true,
"port": 6006,
"momentum": 0.9,
"weight_decay": 5e-4,
"gamma": 0.1,
"send_images": true,
"log_iters": true,
"run_mode": "debug",
"debug_save_image": false,
"debug_save_feature_map": false,
"save_track_data": true,
"contrast_lower": 0.5,
"contrast_upper": 1.5,
"saturation_lower": 0.5,
"saturation_upper": 1.5,
"hue_delta": 18.0,
"brightness_delta": 32,
"max_expand_ratio": 1.1,
"detect_bkg_label": 0,
"detect_top_k": 300,
"detect_conf_thresh": 0.3,
"detect_nms_thresh": 0.3,
"detect_exist_thresh": 0.5,
"tracker_min_iou_thresh": 0.001,
"tracker_min_visibility": 0.4
},
"train": {
"resume": null,
"batch_size": 8,
"num_workers": 0,
"learning_rate": 1e-3,
"lr_decay_per_epoch": [30, 50, 70, 90],
"base_net_weights": "./weights/resnext-101-64f-kinetics.pth",
"log_save_folder": "./logs/train_logs/log",
"image_save_folder": "./logs/train_logs/image",
"weights_save_folder": "./logs/train_logs/weights",
"sequence_list": "./dataset/amot/sequence_list_town02_train_part.txt",
"save_weight_per_epoch": 0.2,
"start_epoch": 0,
"end_epoch": 200,
"tensorboard": true,
"port": 6006,
"momentum": 0.9,
"weight_decay": 5e-4,
"gamma": 0.1,
"send_images": true,
"log_iters": true,
"run_mode": "release",
"debug_save_image": false,
"debug_save_feature_map": false,
"contrast_lower": 0.5,
"contrast_upper": 1.5,
"saturation_lower": 0.5,
"saturation_upper": 1.5,
"hue_delta": 18.0,
"brightness_delta": 32,
"max_expand_ratio": 1.1,
"static_possiblity": 0.05,
"loss_overlap_thresh": 0.5,
"loss_background_label": 0,
"dataset_overlap_thresh": 0.75
},
"frame_work":{
"temporal_dims": [8, 4, 2, 1, 1, 1],
"channel_dims": [256, 512, 1024, 2048, 2048, 2048],
"feature_maps": [42, 21, 11, 6, 3, 2],
"steps": [4, 8, 16, 28, 56, 84],
"min_sizes": [4, 16, 32, 64, 108, 146],
"max_sizes": [16, 32, 64, 108, 146, 176],
"aspect_ratios": [[1.5, 2], [2, 3], [2, 3], [2], [2], [2]],
"boxes_scales": [[0.8333, 0.6667, 0.5, 0.4], [0.8333, 0.5], [0.8333, 0.5], [0.5], [], []],
"variance": [0.1, 0.2],
"branch_cnn": 3,
"clip": true
},
"base_net":{
"mode": "feature",
"model_name": "resnext",
"model_depth": 101,
"resnet_shortcut": "B",
"arch": "resnext-101"
}
}

reading: ~/data/omot_partial_dataset/train/Town02/Clear/50/Easy_Camera_8.avi: 80%|███████████████████████▏ | 4/5 [00:11<00:02, 3.00s/it]

Loading base network...
Loading base net weights into state dict...
Finish
Timer: 8.9696 sec.
iter 0, 3078 || epoch: 0.0000 || Loss: 1485.7682 || Saving weights, iter: 0
Timer: 1.1191 sec.
iter 10, 3078 || epoch: 0.0032 || Loss: 747.8365 || Timer: 1.3092 sec.
iter 20, 3078 || epoch: 0.0065 || Loss: 543.7720 || Timer: 0.9543 sec.
iter 30, 3078 || epoch: 0.0097 || Loss: 483.3675 || Timer: 1.3921 sec.
iter 40, 3078 || epoch: 0.0130 || Loss: 365.0275 || Timer: 0.8044 sec.
iter 50, 3078 || epoch: 0.0162 || Loss: 326.6543 || Timer: 0.9312 sec.
iter 60, 3078 || epoch: 0.0195 || Loss: 366.3070 || Timer: 1.2947 sec.
iter 70, 3078 || epoch: 0.0227 || Loss: 369.2639 || Timer: 1.1285 sec.
iter 80, 3078 || epoch: 0.0260 || Loss: 350.6907 || Timer: 0.8387 sec.
iter 90, 3078 || epoch: 0.0292 || Loss: 279.8749 || Timer: 0.9261 sec.
iter 100, 3078 || epoch: 0.0325 || Loss: 373.5823 || Timer: 0.9241 sec.
iter 110, 3078 || epoch: 0.0357 || Loss: 280.2119 || Timer: 1.1693 sec.
iter 120, 3078 || epoch: 0.0390 || Loss: 273.2498 || Timer: 1.5808 sec.
iter 130, 3078 || epoch: 0.0422 || Loss: 292.5098 || Timer: 0.8698 sec.
iter 140, 3078 || epoch: 0.0455 || Loss: 318.0992 || Timer: 1.0063 sec.
iter 150, 3078 || epoch: 0.0487 || Loss: 252.3025 || Timer: 0.9630 sec.
iter 160, 3078 || epoch: 0.0520 || Loss: 265.3047 || Timer: 1.4840 sec.
iter 170, 3078 || epoch: 0.0552 || Loss: 302.5854 || Timer: 0.9548 sec.
iter 180, 3078 || epoch: 0.0585 || Loss: 285.4070 || Timer: 0.9854 sec.
iter 190, 3078 || epoch: 0.0617 || Loss: 364.6956 || Timer: 4.4852 sec.
iter 200, 3078 || epoch: 0.0650 || Loss: 330.2629 || Timer: 1.0476 sec.
iter 210, 3078 || epoch: 0.0682 || Loss: 230.9167 || Timer: 0.9508 sec.
iter 220, 3078 || epoch: 0.0715 || Loss: 244.9536 || Timer: 0.8630 sec.
iter 230, 3078 || epoch: 0.0747 || Loss: 251.5319 || Timer: 0.8497 sec.
iter 240, 3078 || epoch: 0.0780 || Loss: 263.2910 || Timer: 0.8375 sec.
iter 250, 3078 || epoch: 0.0812 || Loss: 249.8669 || Timer: 0.9519 sec.
iter 260, 3078 || epoch: 0.0845 || Loss: 240.7322 || Timer: 1.1921 sec.
iter 270, 3078 || epoch: 0.0877 || Loss: 263.4745 || Timer: 1.1276 sec.
iter 280, 3078 || epoch: 0.0910 || Loss: 231.0480 || Timer: 1.2120 sec.
iter 290, 3078 || epoch: 0.0942 || Loss: 319.0433 || Timer: 0.7737 sec.
iter 300, 3078 || epoch: 0.0975 || Loss: 253.2475 || Timer: 1.1360 sec.
iter 310, 3078 || epoch: 0.1007 || Loss: 197.9579 || Timer: 0.9226 sec.
iter 320, 3078 || epoch: 0.1040 || Loss: 269.4134 || Timer: 0.9375 sec.

@shijieS
Copy link
Owner

shijieS commented Sep 17, 2020

According to your data, the training hasn't completed an epoch. Therefore, this is normal.

@colorblank
Copy link
Author

Thank you. And I find the data loading time need about 20 min of each iter. So the GPU is always waiting for cpu to process frame from .avi file, is there any other way besides more threading(cpu is almost full power running) to boost data loading time?

@shijieS
Copy link
Owner

shijieS commented Sep 18, 2020

Yes, this is a problem. We use the VideoCapture to load each frame. It takes more time. Here is a way to make it faster. You can convert the video to images and load each image by the imread. But you need to write some codes.

@staticmethod
def get_frame(src, index):
vc = cv2.VideoCapture(src)
vc.set(cv2.CAP_PROP_POS_FRAMES, index)
ret, frame = vc.read()
vc.release()
del vc
return ret, frame

@colorblank
Copy link
Author

OK. Let me try.

@colorblank
Copy link
Author

I use ffmpeg to obtain each frame of amot set. The data save as follows
├── Easy_Camera_0
│ ├── 00001.jpg
│ ├── 00002.jpg
......
│ ├── 04999.jpg
│ └── 05000.jpg
├── Easy_Camera_0.avi
├── Easy_Camera_1
│ ├── 00001.jpg
│ ├── 00002.jpg
......
│ ├── 04999.jpg
│ └── 05000.jpg
├── Easy_Camera_1.avi
├── Easy_Camera_5
│ ├── 00001.jpg
│ ├── 00002.jpg
......
│ ├── 04999.jpg
│ └── 05000.jpg
Easy_Camera_5.avi
├── Easy_Camera_7
│ ├── 00001.jpg
│ ├── 00002.jpg
......
│ ├── 04999.jpg
│ └── 05000.jpg
Easy_Camera_7.avi
├── Easy_Camera_8
│ ├── 00001.jpg
│ ├── 00002.jpg
......
│ ├── 04999.jpg
│ └── 05000.jpg
Easy_Camera_8.avi

And the new version of get_frame is like

image

I run this version with RTX2080Ti x2, batch_size=16, num_workers=0. Each iteration time cost is about 70 sec. But still, GPU is has a long time waiting cpu.

Any further advice?

@shijieS
Copy link
Owner

shijieS commented Sep 18, 2020

In your configure, each iteration will load 128 frames. How about this configure as follows:

num_workers = 16,
batch_size = 4

@colorblank
Copy link
Author

I think the data loading time has some connection with other process in cpu. There are two model running on different GPU, each uses number_workkers>0, it may lead to the long waiting time. And I use your setting, the time varies from 5 sec to 256 sec.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants