-
Notifications
You must be signed in to change notification settings - Fork 187
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
请问训练时长一般是多少呢? #20
Comments
你这500epoch……一个小时50Epoch,一个Epoch 1分钟都不到……很久吗 |
主要是我看别人训练一个gpu,也是500epoch,他5个小时就训练完了,给我整的很慌张,而且一跑这个都不能随便开别的软件,一开就out of memory,loss还起起伏伏的。。有没有啥办法能一边跑一边保存啊。。让我停在loss比较小的时候? |
1、多gpu不一定比少gpu块 |
谢谢大佬回复,我跑完了,模型也保存下来了(之前时因为工作站的电脑不归我一个人使,其他人跑一下我的代码就会显示gpu不够就停了,然后我就老得重新跑)。但是测试的时候一个boundingbox都没有输出orz,求问大佬这一般是啥情况?(我修改好了路径) |
好嘞,谢谢,我去检查一下 |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
训练自己的数据集,一共321张图片,epoch=500,batch_size==8(10就会显示out of memory),
2020-09-30 09:26:59.030397: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1402] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 9484 MB memory) -> physical GPU (device: 0, name: TITAN Xp, pci bus id: 0000:07:00.0, compute capability: 6.1) 2020-09-30 09:26:59.033584: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1402] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:1 with 9484 MB memory) -> physical GPU (device: 1, name: TITAN Xp, pci bus id: 0000:08:00.0, compute capability: 6.1) 2020-09-30 09:26:59.036737: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1402] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:2 with 9484 MB memory) -> physical GPU (device: 2, name: TITAN Xp, pci bus id: 0000:89:00.0, compute capability: 6.1) 2020-09-30 09:26:59.039634: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1402] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:3 with 9484 MB memory) -> physical GPU (device: 3, name: TITAN Xp, pci bus id: 0000:8a:00.0, compute capability: 6.1)
这个报应该时4个gpu都用上了吧,为啥我都得10个小时左右才能训练完。
而且训练过程中的loss周期性的起伏
是什么原因呢?期待您的回答,谢谢!
The text was updated successfully, but these errors were encountered: