- main1
- LoopTest1
- LoopTest2
- single_threading1
其逻辑为:
-
main1 任务:
- 在hdfs的/shared/文件夹中清空(或创建)work文件夹
- 在work文件夹中创建new_itr.txt文件并写入循环数(如果已经完成将写入'over'字符串,并结束)
- 等待report.txt文件(该文件中为已经完成的循环次数)
- 将3中所得的循环次数加上已经循环的次数,如果大于tmax则判断为已完成,回到步骤ii
{ "jobName": "test1-main-zmy-001_runable_fdasf", "image": "registry.cn-hangzhou.aliyuncs.com/lgy_sustech/olmp:v2", "virtualCluster": "default", "taskRoles": [ { "name": "main", "taskNumber": 1, "cpuNumber": 2, "memoryMB": 8192, "gpuNumber": 0, "command": "git clone https://github.com/mythezone/OpenPAI_zmy.git && cd OpenPAI_zmy&&chmod 777 * && python main1.py " } ] }
-
LoopTest1任务:
- 对神经网络进行训练,得到原始的神经网络data,accuracy,生成初始solution并写入work文件夹中
- 等待work文件夹中的new_itr.txt文件
- 如果文件内容为over就结束,否则继续
- 在work文件夹下生成ncs_start.txt文件,调用内层循环计算神经网络剪枝的solution,并使用solution神经网络进行压缩
- 将已经循环的次数写入report.txt文件,回到步骤ii
{ "jobName": "test1-LoopTest1-zmy-002_runable_fdafd", "image": "registry.cn-hangzhou.aliyuncs.com/lgy_sustech/olmp:v2", "virtualCluster": "default", "taskRoles": [ { "name": "main", "taskNumber": 1, "cpuNumber": 4, "memoryMB": 8192, "gpuNumber": 1, "command": "git clone https://github.com/mythezone/OpenPAI_zmy.git && cd OpenPAI_zmy&&chmod 777 * && python LoopTest1.py " } ] }
-
LoopTest2
- 该循环获取ncs_start.txt文件后开始工作
- 获取work文件夹下的一些必要的数据后开始ncs初始化和训练.
- 将初始化的solutions分解成n单个的solution并存入hdfs上work文件夹中,文件名为solution_XXXXXX.npy
- 等待n个文件名为fit_solutions_XXXXXXX.npy文件中的结果,合并后进行ncs训练
- 将结果写入crates_list.npy文件.并创建ncs_over.txt文件,回到步骤i
{ "jobName": "test1-LoopTest2-zmy-003_runable_dfafda", "image": "registry.cn-hangzhou.aliyuncs.com/lgy_sustech/olmp:v2", "virtualCluster": "default", "taskRoles": [ { "name": "main", "taskNumber": 1, "cpuNumber": 4, "memoryMB": 8192, "gpuNumber": 0, "command": "git clone https://github.com/mythezone/OpenPAI_zmy.git && cd OpenPAI_zmy&&chmod 777 * && python LoopTest2.py " } ] }
-
single_threading1
- 该任务等待solution_XXXXXX.npy文件
- 获得该类文件后,计算fitness值,将结果写入fit_solution_XXXXXXX.npy文件,并删除原solution文件
- 回到步骤i
{ "jobName": "test1-single_evaluate-zmy-004_runable", "image": "registry.cn-hangzhou.aliyuncs.com/lgy_sustech/olmp:v2", "virtualCluster": "default", "taskRoles": [ { "name": "main", "taskNumber": 1, "cpuNumber": 4, "memoryMB": 8192, "gpuNumber": 1, "command": "git clone https://github.com/mythezone/OpenPAI_zmy.git && cd OpenPAI_zmy&&chmod 777 * && python single_threading1.py " } ] }
{
"jobName": "demo_test_16af8e48",
"image": "registry.cn-hangzhou.aliyuncs.com/lgy_sustech/olmp:v2",
"virtualCluster": "default",
"taskRoles": [
{
"name": "main",
"taskNumber": 1,
"cpuNumber": 2,
"memoryMB": 8192,
"gpuNumber": 0,
"command": "git clone https://github.com/mythezone/OpenPAI_zmy.git && cd OpenPAI_zmy&&chmod 777 * && python main1.py "
},
{
"name": "looptest1",
"taskNumber": 1,
"cpuNumber": 4,
"memoryMB": 8192,
"gpuNumber": 1,
"command": "git clone https://github.com/mythezone/OpenPAI_zmy.git && cd OpenPAI_zmy&&chmod 777 * && python LoopTest1.py "
},
{
"name": "looptest2",
"taskNumber": 1,
"cpuNumber": 4,
"memoryMB": 8192,
"gpuNumber": 0,
"command": "git clone https://github.com/mythezone/OpenPAI_zmy.git && cd OpenPAI_zmy&&chmod 777 * && python LoopTest2.py "
},
{
"name": "single_threading",
"taskNumber": 1,
"cpuNumber": 4,
"memoryMB": 8192,
"gpuNumber": 1,
"command": "git clone https://github.com/mythezone/OpenPAI_zmy.git && cd OpenPAI_zmy&&chmod 777 * && python single_threading1.py "
}
]
}
- 除main1 Job外其他Job在任务完成后均不会自动停止,等待下一个任务到来可以直接开始计算.如果是载入新的数据和模型,这里可能需要调整
- 任务完成后,clone下来的文件夹中,以及hdfs/shared/work文件夹中的tmp_model.caffemodel即最后生成的模型.