optimize.backward 使用梯度累加显存占用和梯度累加步长成正比，导致显存占用过大无法训练GPT2-XL #602

peter-ni-noob · 2024-11-21T14:09:42Z

Describe the bug

A clear and concise description of what the bug is. 使用中文也可以。
optimize.backward 使用梯度累加显存占用和梯度累加步长成正比，导致显存占用过大无法训练GPT2-XL

Full Log

�[38;5;2m[i 1121 22:05:25.126489 72 compiler.py:956] Jittor(1.3.9.12) src: /opt/miniconda3/envs/jittor/lib/python3.9/site-packages/jittor�[m
�[38;5;2m[i 1121 22:05:25.131258 72 compiler.py:957] g++ at /usr/bin/g++(9.4.0)�[m
�[38;5;2m[i 1121 22:05:25.131395 72 compiler.py:958] cache_path: /root/.cache/jittor/jt1.3.9/g++9.4.0/py3.9.20/Linux-5.15.0-6x3c/AMDEPYC776364-xec/8775/default�[m
�[38;5;2m[i 1121 22:05:25.137071 72 init.py:412] Found nvcc(11.7.99) at /usr/local/cuda-11.7/bin/nvcc.�[m
�[38;5;2m[i 1121 22:05:25.184163 72 init.py:412] Found gdb(20.04.2) at /usr/bin/gdb.�[m
�[38;5;2m[i 1121 22:05:25.189004 72 init.py:412] Found addr2line(2.34) at /usr/bin/addr2line.�[m
�[38;5;2m[i 1121 22:05:25.794673 72 compiler.py:1013] cuda key:cu11.7.99_sm_80�[m
�[38;5;2m[i 1121 22:05:26.114295 72 init.py:227] Total mem: 1007.73GB, using 16 procs for compiling.�[m
�[38;5;2m[i 1121 22:05:26.219275 72 jit_compiler.cc:28] Load cc_path: /usr/bin/g++�[m
�[38;5;2m[i 1121 22:05:26.658102 72 init.cc:63] Found cuda archs: [80,]�[m
�[38;5;2m[i 1121 22:05:26.747346 72 init.py:412] Found mpicc(4.0.0) at /usr/local/bin/mpicc.�[m
�[38;5;2m[i 1121 22:05:26.822027 60 compiler.py:956] Jittor(1.3.9.12) src: /opt/miniconda3/envs/jittor/lib/python3.9/site-packages/jittor�[m
�[38;5;2m[i 1121 22:05:26.826967 60 compiler.py:957] g++ at /usr/bin/g++(9.4.0)�[m
�[38;5;2m[i 1121 22:05:26.827097 60 compiler.py:958] cache_path: /root/.cache/jittor/jt1.3.9/g++9.4.0/py3.9.20/Linux-5.15.0-6x3c/AMDEPYC776364-xec/8775/default�[m
�[38;5;2m[i 1121 22:05:26.833472 60 init.py:412] Found nvcc(11.7.99) at /usr/local/cuda-11.7/bin/nvcc.�[m
�[38;5;2m[i 1121 22:05:26.878797 60 init.py:412] Found gdb(20.04.2) at /usr/bin/gdb.�[m
�[38;5;2m[i 1121 22:05:26.884155 60 init.py:412] Found addr2line(2.34) at /usr/bin/addr2line.�[m
�[38;5;2m[i 1121 22:05:27.047531 60 compiler.py:1013] cuda key:cu11.7.99_sm_80�[m
�[38;5;2m[i 1121 22:05:27.362661 60 init.py:227] Total mem: 1007.73GB, using 16 procs for compiling.�[m
�[38;5;2m[i 1121 22:05:27.466261 60 jit_compiler.cc:28] Load cc_path: /usr/bin/g++�[m
�[38;5;2m[i 1121 22:05:27.653196 60 init.cc:63] Found cuda archs: [80,]�[m
�[38;5;2m[i 1121 22:05:27.743503 60 init.py:412] Found mpicc(4.0.0) at /usr/local/bin/mpicc.�[m
�[38;5;2m[i 1121 22:05:27.817547 44 compiler.py:956] Jittor(1.3.9.12) src: /opt/miniconda3/envs/jittor/lib/python3.9/site-packages/jittor�[m
�[38;5;2m[i 1121 22:05:27.822749 44 compiler.py:957] g++ at /usr/bin/g++(9.4.0)�[m
�[38;5;2m[i 1121 22:05:27.822873 44 compiler.py:958] cache_path: /root/.cache/jittor/jt1.3.9/g++9.4.0/py3.9.20/Linux-5.15.0-6x3c/AMDEPYC776364-xec/8775/default�[m
�[38;5;2m[i 1121 22:05:27.829518 44 init.py:412] Found nvcc(11.7.99) at /usr/local/cuda-11.7/bin/nvcc.�[m
�[38;5;2m[i 1121 22:05:27.876104 44 init.py:412] Found gdb(20.04.2) at /usr/bin/gdb.�[m
�[38;5;2m[i 1121 22:05:27.881783 44 init.py:412] Found addr2line(2.34) at /usr/bin/addr2line.�[m
�[38;5;2m[i 1121 22:05:28.054322 44 compiler.py:1013] cuda key:cu11.7.99_sm_80�[m
�[38;5;2m[i 1121 22:05:28.371426 44 init.py:227] Total mem: 1007.73GB, using 16 procs for compiling.�[m
�[38;5;2m[i 1121 22:05:28.475162 44 jit_compiler.cc:28] Load cc_path: /usr/bin/g++�[m
�[38;5;2m[i 1121 22:05:28.646488 44 init.cc:63] Found cuda archs: [80,]�[m
�[38;5;2m[i 1121 22:05:28.744322 44 init.py:412] Found mpicc(4.0.0) at /usr/local/bin/mpicc.�[m
�[38;5;2m[i 1121 22:05:28.828598 24 compiler.py:956] Jittor(1.3.9.12) src: /opt/miniconda3/envs/jittor/lib/python3.9/site-packages/jittor�[m
�[38;5;2m[i 1121 22:05:28.834321 24 compiler.py:957] g++ at /usr/bin/g++(9.4.0)�[m
�[38;5;2m[i 1121 22:05:28.834480 24 compiler.py:958] cache_path: /root/.cache/jittor/jt1.3.9/g++9.4.0/py3.9.20/Linux-5.15.0-6x3c/AMDEPYC776364-xec/8775/default�[m
�[38;5;2m[i 1121 22:05:28.841741 24 init.py:412] Found nvcc(11.7.99) at /usr/local/cuda-11.7/bin/nvcc.�[m
�[38;5;2m[i 1121 22:05:28.888582 24 init.py:412] Found gdb(20.04.2) at /usr/bin/gdb.�[m
�[38;5;2m[i 1121 22:05:28.894028 24 init.py:412] Found addr2line(2.34) at /usr/bin/addr2line.�[m
�[38;5;2m[i 1121 22:05:29.070631 24 compiler.py:1013] cuda key:cu11.7.99_sm_80�[m
�[38;5;2m[i 1121 22:05:29.386292 24 init.py:227] Total mem: 1007.73GB, using 16 procs for compiling.�[m
�[38;5;2m[i 1121 22:05:29.490768 24 jit_compiler.cc:28] Load cc_path: /usr/bin/g++�[m
�[38;5;2m[i 1121 22:05:29.679001 24 init.cc:63] Found cuda archs: [80,]�[m
�[38;5;2m[i 1121 22:05:29.769026 24 init.py:412] Found mpicc(4.0.0) at /usr/local/bin/mpicc.�[m
�[38;5;2m[i 1121 22:05:29.841810 80 compiler.py:956] Jittor(1.3.9.12) src: /opt/miniconda3/envs/jittor/lib/python3.9/site-packages/jittor�[m
�[38;5;2m[i 1121 22:05:29.846868 80 compiler.py:957] g++ at /usr/bin/g++(9.4.0)�[m
�[38;5;2m[i 1121 22:05:29.847024 80 compiler.py:958] cache_path: /root/.cache/jittor/jt1.3.9/g++9.4.0/py3.9.20/Linux-5.15.0-6x3c/AMDEPYC776364-xec/8775/default�[m
�[38;5;2m[i 1121 22:05:29.853689 80 init.py:412] Found nvcc(11.7.99) at /usr/local/cuda-11.7/bin/nvcc.�[m
�[38;5;2m[i 1121 22:05:29.900384 80 init.py:412] Found gdb(20.04.2) at /usr/bin/gdb.�[m
�[38;5;2m[i 1121 22:05:29.905803 80 init.py:412] Found addr2line(2.34) at /usr/bin/addr2line.�[m
�[38;5;2m[i 1121 22:05:30.086972 80 compiler.py:1013] cuda key:cu11.7.99_sm_80�[m
�[38;5;2m[i 1121 22:05:30.397061 80 init.py:227] Total mem: 1007.73GB, using 16 procs for compiling.�[m
�[38;5;2m[i 1121 22:05:30.499719 80 jit_compiler.cc:28] Load cc_path: /usr/bin/g++�[m
�[38;5;2m[i 1121 22:05:30.702439 80 init.cc:63] Found cuda archs: [80,]�[m
�[38;5;2m[i 1121 22:05:30.792454 80 init.py:412] Found mpicc(4.0.0) at /usr/local/bin/mpicc.�[m
�[38;5;2m[i 1121 22:05:30.868212 04 compiler.py:956] Jittor(1.3.9.12) src: /opt/miniconda3/envs/jittor/lib/python3.9/site-packages/jittor�[m
�[38;5;2m[i 1121 22:05:30.872909 04 compiler.py:957] g++ at /usr/bin/g++(9.4.0)�[m
�[38;5;2m[i 1121 22:05:30.873038 04 compiler.py:958] cache_path: /root/.cache/jittor/jt1.3.9/g++9.4.0/py3.9.20/Linux-5.15.0-6x3c/AMDEPYC776364-xec/8775/default�[m
�[38;5;2m[i 1121 22:05:30.879150 04 init.py:412] Found nvcc(11.7.99) at /usr/local/cuda-11.7/bin/nvcc.�[m
�[38;5;2m[i 1121 22:05:30.924232 04 init.py:412] Found gdb(20.04.2) at /usr/bin/gdb.�[m
�[38;5;2m[i 1121 22:05:30.929277 04 init.py:412] Found addr2line(2.34) at /usr/bin/addr2line.�[m
�[38;5;2m[i 1121 22:05:31.077949 04 compiler.py:1013] cuda key:cu11.7.99_sm_80�[m
�[38;5;2m[i 1121 22:05:31.391125 04 init.py:227] Total mem: 1007.73GB, using 16 procs for compiling.�[m
�[38;5;2m[i 1121 22:05:31.494917 04 jit_compiler.cc:28] Load cc_path: /usr/bin/g++�[m
�[38;5;2m[i 1121 22:05:31.692292 04 init.cc:63] Found cuda archs: [80,]�[m
�[38;5;2m[i 1121 22:05:31.782711 04 init.py:412] Found mpicc(4.0.0) at /usr/local/bin/mpicc.�[m
�[38;5;2m[i 1121 22:05:31.862101 00 compiler.py:956] Jittor(1.3.9.12) src: /opt/miniconda3/envs/jittor/lib/python3.9/site-packages/jittor�[m
�[38;5;2m[i 1121 22:05:31.867603 00 compiler.py:957] g++ at /usr/bin/g++(9.4.0)�[m
�[38;5;2m[i 1121 22:05:31.867744 00 compiler.py:958] cache_path: /root/.cache/jittor/jt1.3.9/g++9.4.0/py3.9.20/Linux-5.15.0-6x3c/AMDEPYC776364-xec/8775/default�[m
�[38;5;2m[i 1121 22:05:31.874701 00 init.py:412] Found nvcc(11.7.99) at /usr/local/cuda-11.7/bin/nvcc.�[m
�[38;5;2m[i 1121 22:05:31.922792 00 init.py:412] Found gdb(20.04.2) at /usr/bin/gdb.�[m
�[38;5;2m[i 1121 22:05:31.928394 00 init.py:412] Found addr2line(2.34) at /usr/bin/addr2line.�[m
�[38;5;2m[i 1121 22:05:32.100872 00 compiler.py:1013] cuda key:cu11.7.99_sm_80�[m
�[38;5;2m[i 1121 22:05:32.417552 00 init.py:227] Total mem: 1007.73GB, using 16 procs for compiling.�[m
�[38;5;2m[i 1121 22:05:32.522106 00 jit_compiler.cc:28] Load cc_path: /usr/bin/g++�[m
�[38;5;2m[i 1121 22:05:32.713764 00 init.cc:63] Found cuda archs: [80,]�[m
�[38;5;2m[i 1121 22:05:32.802922 00 init.py:412] Found mpicc(4.0.0) at /usr/local/bin/mpicc.�[m
�[38;5;2m[i 1121 22:05:32.876889 88 compiler.py:956] Jittor(1.3.9.12) src: /opt/miniconda3/envs/jittor/lib/python3.9/site-packages/jittor�[m
�[38;5;2m[i 1121 22:05:32.882065 88 compiler.py:957] g++ at /usr/bin/g++(9.4.0)�[m
�[38;5;2m[i 1121 22:05:32.882195 88 compiler.py:958] cache_path: /root/.cache/jittor/jt1.3.9/g++9.4.0/py3.9.20/Linux-5.15.0-6x3c/AMDEPYC776364-xec/8775/default�[m
�[38;5;2m[i 1121 22:05:32.887932 88 init.py:412] Found nvcc(11.7.99) at /usr/local/cuda-11.7/bin/nvcc.�[m
�[38;5;2m[i 1121 22:05:32.940282 88 init.py:412] Found gdb(20.04.2) at /usr/bin/gdb.�[m
�[38;5;2m[i 1121 22:05:32.945858 88 init.py:412] Found addr2line(2.34) at /usr/bin/addr2line.�[m
�[38;5;2m[i 1121 22:05:33.094107 88 compiler.py:1013] cuda key:cu11.7.99_sm_80�[m
�[38;5;2m[i 1121 22:05:33.406601 88 init.py:227] Total mem: 1007.73GB, using 16 procs for compiling.�[m
�[38;5;2m[i 1121 22:05:33.508741 88 jit_compiler.cc:28] Load cc_path: /usr/bin/g++�[m
�[38;5;2m[i 1121 22:05:33.721979 88 init.cc:63] Found cuda archs: [80,]�[m
�[38;5;2m[i 1121 22:05:33.811356 88 init.py:412] Found mpicc(4.0.0) at /usr/local/bin/mpicc.�[m
�[38;5;3m[w 1121 22:05:36.626897 72 compile_extern.py:203] CUDA related path found in LD_LIBRARY_PATH or PATH, This path may cause jittor found the wrong libs, please unset LD_LIBRARY_PATH and remove cuda lib path in Path.
Or you can let jittor install cuda for you: `python3.x -m jittor_utils.install_cuda`�[m
�[38;5;3m[w 1121 22:05:36.716219 44 compile_extern.py:203] CUDA related path found in LD_LIBRARY_PATH or PATH, This path may cause jittor found the wrong libs, please unset LD_LIBRARY_PATH and remove cuda lib path in Path.
Or you can let jittor install cuda for you: `python3.x -m jittor_utils.install_cuda`�[m
�[38;5;3m[w 1121 22:05:37.031659 88 compile_extern.py:203] CUDA related path found in LD_LIBRARY_PATH or PATH, This path may cause jittor found the wrong libs, please unset LD_LIBRARY_PATH and remove cuda lib path in Path.
Or you can let jittor install cuda for you: `python3.x -m jittor_utils.install_cuda`�[m
�[38;5;3m[w 1121 22:05:37.119814 00 compile_extern.py:203] CUDA related path found in LD_LIBRARY_PATH or PATH, This path may cause jittor found the wrong libs, please unset LD_LIBRARY_PATH and remove cuda lib path in Path.
Or you can let jittor install cuda for you: `python3.x -m jittor_utils.install_cuda`�[m
�[38;5;3m[w 1121 22:05:37.162955 60 compile_extern.py:203] CUDA related path found in LD_LIBRARY_PATH or PATH, This path may cause jittor found the wrong libs, please unset LD_LIBRARY_PATH and remove cuda lib path in Path.
Or you can let jittor install cuda for you: `python3.x -m jittor_utils.install_cuda`�[m
�[38;5;3m[w 1121 22:05:37.208066 04 compile_extern.py:203] CUDA related path found in LD_LIBRARY_PATH or PATH, This path may cause jittor found the wrong libs, please unset LD_LIBRARY_PATH and remove cuda lib path in Path.
Or you can let jittor install cuda for you: `python3.x -m jittor_utils.install_cuda`�[m
�[38;5;3m[w 1121 22:05:37.253027 80 compile_extern.py:203] CUDA related path found in LD_LIBRARY_PATH or PATH, This path may cause jittor found the wrong libs, please unset LD_LIBRARY_PATH and remove cuda lib path in Path.
Or you can let jittor install cuda for you: `python3.x -m jittor_utils.install_cuda`�[m
�[38;5;3m[w 1121 22:05:37.299054 24 compile_extern.py:203] CUDA related path found in LD_LIBRARY_PATH or PATH, This path may cause jittor found the wrong libs, please unset LD_LIBRARY_PATH and remove cuda lib path in Path.
Or you can let jittor install cuda for you: `python3.x -m jittor_utils.install_cuda`�[m
�[38;5;2m[i 1121 22:05:38.515927 72 cuda_flags.cc:49] CUDA enabled.�[m
�[38;5;2m[i 1121 22:05:38.784805 44 cuda_flags.cc:49] CUDA enabled.�[m
�[38;5;2m[i 1121 22:05:38.999090 88 cuda_flags.cc:49] CUDA enabled.�[m
�[38;5;2m[i 1121 22:05:39.105256 80 cuda_flags.cc:49] CUDA enabled.�[m
�[38;5;2m[i 1121 22:05:39.108989 00 cuda_flags.cc:49] CUDA enabled.�[m
�[38;5;2m[i 1121 22:05:39.109386 04 cuda_flags.cc:49] CUDA enabled.�[m
�[38;5;2m[i 1121 22:05:39.109614 60 cuda_flags.cc:49] CUDA enabled.�[m
�[38;5;2m[i 1121 22:05:39.137403 24 cuda_flags.cc:49] CUDA enabled.�[m
�[38;5;2m[i 1121 22:05:42.979075 00 cuda_device_allocator.cc:30]
=== display_memory_info ===
total_cpu_ram: 1008GB total_device_ram: 79.33GB
hold_vars: 2564 lived_vars: 61801 lived_ops: 55684
name: sfrl is_device: 1 used: 78.04GB(99.5%) unused: 372MB(0.463%) ULB: 13.75MB ULBO: 20MB total: 78.4GB
name: sfrl is_device: 1 used: 0 B(-nan%) unused: 0 B(-nan%) total: 0 B
name: sfrl is_device: 0 used: 0 B(-nan%) unused: 0 B(-nan%) total: 0 B
name: sfrl is_device: 0 used: 3.152MB(78.8%) unused: 868.5KB(21.2%) total: 4MB
name: sfrl is_device: 0 used: 0 B(-nan%) unused: 0 B(-nan%) total: 0 B
name: temp is_device: 0 used: 0 B(-nan%) unused: 0 B(-nan%) total: 0 B
name: temp is_device: 1 used: 0 B(-nan%) unused: 0 B(-nan%) total: 0 B
cpu&gpu: 78.41GB gpu: 78.4GB cpu: 4MB
free: cpu(739.2GB) gpu(28.69MB)
swap: total( 0 B) last( 0 B)

�[m
�[38;5;2m[i 1121 22:05:43.006535 04 cuda_device_allocator.cc:30]
=== display_memory_info ===
total_cpu_ram: 1008GB total_device_ram: 79.33GB
hold_vars: 2564 lived_vars: 61801 lived_ops: 55684
name: sfrl is_device: 1 used: 78.04GB(99.5%) unused: 372MB(0.463%) ULB: 13.75MB ULBO: 20MB total: 78.4GB
name: sfrl is_device: 1 used: 0 B(-nan%) unused: 0 B(-nan%) total: 0 B
name: sfrl is_device: 0 used: 0 B(-nan%) unused: 0 B(-nan%) total: 0 B
name: sfrl is_device: 0 used: 3.152MB(78.8%) unused: 868.5KB(21.2%) total: 4MB
name: sfrl is_device: 0 used: 0 B(-nan%) unused: 0 B(-nan%) total: 0 B
name: temp is_device: 0 used: 0 B(-nan%) unused: 0 B(-nan%) total: 0 B
name: temp is_device: 1 used: 0 B(-nan%) unused: 0 B(-nan%) total: 0 B
cpu&gpu: 78.41GB gpu: 78.4GB cpu: 4MB
free: cpu(739.2GB) gpu(28.69MB)
swap: total( 0 B) last( 0 B)

�[m
�[38;5;2m[i 1121 22:05:43.042414 60 cuda_device_allocator.cc:30]
=== display_memory_info ===
total_cpu_ram: 1008GB total_device_ram: 79.33GB
hold_vars: 2564 lived_vars: 61801 lived_ops: 55684
name: sfrl is_device: 1 used: 78.04GB(99.5%) unused: 372MB(0.463%) ULB: 13.75MB ULBO: 20MB total: 78.4GB
name: sfrl is_device: 1 used: 0 B(-nan%) unused: 0 B(-nan%) total: 0 B
name: sfrl is_device: 0 used: 0 B(-nan%) unused: 0 B(-nan%) total: 0 B
name: sfrl is_device: 0 used: 3.152MB(78.8%) unused: 868.5KB(21.2%) total: 4MB
name: sfrl is_device: 0 used: 0 B(-nan%) unused: 0 B(-nan%) total: 0 B
name: temp is_device: 0 used: 0 B(-nan%) unused: 0 B(-nan%) total: 0 B
name: temp is_device: 1 used: 0 B(-nan%) unused: 0 B(-nan%) total: 0 B
cpu&gpu: 78.41GB gpu: 78.4GB cpu: 4MB
free: cpu(739.2GB) gpu(28.69MB)
swap: total( 0 B) last( 0 B)

�[m
�[38;5;2m[i 1121 22:05:43.044648 00 cuda_device_allocator.cc:30]
=== display_memory_info ===
total_cpu_ram: 1008GB total_device_ram: 79.33GB
hold_vars: 2564 lived_vars: 61801 lived_ops: 55684
name: sfrl is_device: 1 used: 78.04GB(99.5%) unused: 370MB(0.461%) ULB: 13.75MB ULBO: 20MB total: 78.4GB
name: sfrl is_device: 1 used: 0 B(-nan%) unused: 0 B(-nan%) total: 0 B
name: sfrl is_device: 0 used: 0 B(-nan%) unused: 0 B(-nan%) total: 0 B
name: sfrl is_device: 0 used: 3.152MB(78.8%) unused: 868.5KB(21.2%) total: 4MB
name: sfrl is_device: 0 used: 0 B(-nan%) unused: 0 B(-nan%) total: 0 B
name: temp is_device: 0 used: 0 B(-nan%) unused: 0 B(-nan%) total: 0 B
name: temp is_device: 1 used: 0 B(-nan%) unused: 0 B(-nan%) total: 0 B
cpu&gpu: 78.4GB gpu: 78.4GB cpu: 4MB
free: cpu(739.2GB) gpu(30.69MB)
swap: total( 0 B) last( 0 B)

�[m
�[38;5;1m[e 1121 22:05:43.044784 00 executor.cc:682]
=== display_memory_info ===
total_cpu_ram: 1008GB total_device_ram: 79.33GB
hold_vars: 2564 lived_vars: 61801 lived_ops: 55684
name: sfrl is_device: 1 used: 78.04GB(99.5%) unused: 370MB(0.461%) ULB: 13.75MB ULBO: 20MB total: 78.4GB
name: sfrl is_device: 1 used: 0 B(-nan%) unused: 0 B(-nan%) total: 0 B
name: sfrl is_device: 0 used: 0 B(-nan%) unused: 0 B(-nan%) total: 0 B
name: sfrl is_device: 0 used: 3.152MB(78.8%) unused: 868.5KB(21.2%) total: 4MB
name: sfrl is_device: 0 used: 0 B(-nan%) unused: 0 B(-nan%) total: 0 B
name: temp is_device: 0 used: 0 B(-nan%) unused: 0 B(-nan%) total: 0 B
name: temp is_device: 1 used: 0 B(-nan%) unused: 0 B(-nan%) total: 0 B
cpu&gpu: 78.4GB gpu: 78.4GB cpu: 4MB
free: cpu(739.2GB) gpu(30.69MB)
swap: total( 0 B) last( 0 B)

�[m
Traceback (most recent call last):
File "/root/workspace/Jittor/model_dis.py", line 479, in
train(model,dataloader,loss_function,optimizer,acc_step)
File "/root/workspace/Jittor/model_dis.py", line 459, in train
loss_acc+=loss.item()*acc_step
RuntimeError: Wrong inputs arguments, Please refer to examples(help(jt.item)).

Types of your inputs are:
self = Var,
args = (),

The function declarations are:
ItemData item()

Failed reason:�[38;5;1m[f 1121 22:05:43.044890 00 mem_info.cc:272]

GPU memory is overflow, please reduce your batch_size or data size!
Total: 79.33GB Used: 78.4GB�[m
�[38;5;2m[i 1121 22:05:43.055991 88 cuda_device_allocator.cc:30]
=== display_memory_info ===
total_cpu_ram: 1008GB total_device_ram: 79.33GB
hold_vars: 2564 lived_vars: 61799 lived_ops: 55684
name: sfrl is_device: 1 used: 78.14GB(99.5%) unused: 371MB(0.462%) ULB: 13.75MB ULBO: 20MB total: 78.5GB
name: sfrl is_device: 1 used: 0 B(-nan%) unused: 0 B(-nan%) total: 0 B
name: sfrl is_device: 0 used: 0 B(-nan%) unused: 0 B(-nan%) total: 0 B
name: sfrl is_device: 0 used: 3.152MB(78.8%) unused: 868.5KB(21.2%) total: 4MB
name: sfrl is_device: 0 used: 0 B(-nan%) unused: 0 B(-nan%) total: 0 B
name: temp is_device: 0 used: 0 B(-nan%) unused: 0 B(-nan%) total: 0 B
name: temp is_device: 1 used: 0 B(-nan%) unused: 0 B(-nan%) total: 0 B
cpu&gpu: 78.5GB gpu: 78.5GB cpu: 4MB
free: cpu(739.2GB) gpu(24.69MB)
swap: total( 0 B) last( 0 B)

�[m
�[38;5;2m[i 1121 22:05:43.060909 72 cuda_device_allocator.cc:30]
=== display_memory_info ===
total_cpu_ram: 1008GB total_device_ram: 79.33GB
hold_vars: 2564 lived_vars: 48379 lived_ops: 43318
name: sfrl is_device: 1 used: 75.23GB(99.5%) unused: 367.1MB(0.474%) ULB: 12.5MB ULBO: 100MB total: 75.59GB
name: sfrl is_device: 1 used: 0 B(-nan%) unused: 0 B(-nan%) total: 0 B
name: sfrl is_device: 0 used: 0 B(-nan%) unused: 0 B(-nan%) total: 0 B
name: sfrl is_device: 0 used: 2.417MB(80.6%) unused: 596.5KB(19.4%) total: 3MB
name: sfrl is_device: 0 used: 0 B(-nan%) unused: 0 B(-nan%) total: 0 B
name: temp is_device: 0 used: 0 B(-nan%) unused: 0 B(-nan%) total: 0 B
name: temp is_device: 1 used: 0 B(-nan%) unused: 0 B(-nan%) total: 0 B
cpu&gpu: 75.6GB gpu: 75.59GB cpu: 3MB
free: cpu(739.2GB) gpu(77.06MB)
swap: total( 0 B) last( 0 B)

�[m
�[38;5;2m[i 1121 22:05:43.077086 04 cuda_device_allocator.cc:30]
=== display_memory_info ===
total_cpu_ram: 1008GB total_device_ram: 79.33GB
hold_vars: 2564 lived_vars: 61801 lived_ops: 55684
name: sfrl is_device: 1 used: 78.04GB(99.5%) unused: 370MB(0.461%) ULB: 13.75MB ULBO: 20MB total: 78.4GB
name: sfrl is_device: 1 used: 0 B(-nan%) unused: 0 B(-nan%) total: 0 B
name: sfrl is_device: 0 used: 0 B(-nan%) unused: 0 B(-nan%) total: 0 B
name: sfrl is_device: 0 used: 3.152MB(78.8%) unused: 868.5KB(21.2%) total: 4MB
name: sfrl is_device: 0 used: 0 B(-nan%) unused: 0 B(-nan%) total: 0 B
name: temp is_device: 0 used: 0 B(-nan%) unused: 0 B(-nan%) total: 0 B
name: temp is_device: 1 used: 0 B(-nan%) unused: 0 B(-nan%) total: 0 B
cpu&gpu: 78.4GB gpu: 78.4GB cpu: 4MB
free: cpu(739.2GB) gpu(30.69MB)
swap: total( 0 B) last( 0 B)

�[m
�[38;5;1m[e 1121 22:05:43.077168 04 executor.cc:682]
=== display_memory_info ===
total_cpu_ram: 1008GB total_device_ram: 79.33GB
hold_vars: 2564 lived_vars: 61801 lived_ops: 55684
name: sfrl is_device: 1 used: 78.04GB(99.5%) unused: 370MB(0.461%) ULB: 13.75MB ULBO: 20MB total: 78.4GB
name: sfrl is_device: 1 used: 0 B(-nan%) unused: 0 B(-nan%) total: 0 B
name: sfrl is_device: 0 used: 0 B(-nan%) unused: 0 B(-nan%) total: 0 B
name: sfrl is_device: 0 used: 3.152MB(78.8%) unused: 868.5KB(21.2%) total: 4MB
name: sfrl is_device: 0 used: 0 B(-nan%) unused: 0 B(-nan%) total: 0 B
name: temp is_device: 0 used: 0 B(-nan%) unused: 0 B(-nan%) total: 0 B
name: temp is_device: 1 used: 0 B(-nan%) unused: 0 B(-nan%) total: 0 B
cpu&gpu: 78.4GB gpu: 78.4GB cpu: 4MB
free: cpu(739.2GB) gpu(30.69MB)
swap: total( 0 B) last( 0 B)

�[m
Traceback (most recent call last):
File "/root/workspace/Jittor/model_dis.py", line 479, in
train(model,dataloader,loss_function,optimizer,acc_step)
File "/root/workspace/Jittor/model_dis.py", line 459, in train
loss_acc+=loss.item()*acc_step
RuntimeError: Wrong inputs arguments, Please refer to examples(help(jt.item)).

Types of your inputs are:
self = Var,
args = (),

The function declarations are:
ItemData item()

Failed reason:�[38;5;1m[f 1121 22:05:43.077206 04 mem_info.cc:272]

GPU memory is overflow, please reduce your batch_size or data size!
Total: 79.33GB Used: 78.4GB�[m
�[38;5;2m[i 1121 22:05:43.083065 80 cuda_device_allocator.cc:30]
=== display_memory_info ===
total_cpu_ram: 1008GB total_device_ram: 79.33GB
hold_vars: 2564 lived_vars: 61801 lived_ops: 55684
name: sfrl is_device: 1 used: 78.04GB(99.5%) unused: 372MB(0.463%) ULB: 13.75MB ULBO: 20MB total: 78.4GB
name: sfrl is_device: 1 used: 0 B(-nan%) unused: 0 B(-nan%) total: 0 B
name: sfrl is_device: 0 used: 0 B(-nan%) unused: 0 B(-nan%) total: 0 B
name: sfrl is_device: 0 used: 3.152MB(78.8%) unused: 868.5KB(21.2%) total: 4MB
name: sfrl is_device: 0 used: 0 B(-nan%) unused: 0 B(-nan%) total: 0 B
name: temp is_device: 0 used: 0 B(-nan%) unused: 0 B(-nan%) total: 0 B
name: temp is_device: 1 used: 0 B(-nan%) unused: 0 B(-nan%) total: 0 B
cpu&gpu: 78.41GB gpu: 78.4GB cpu: 4MB
free: cpu(739.2GB) gpu(28.69MB)
swap: total( 0 B) last( 0 B)

�[m
�[38;5;2m[i 1121 22:05:43.085173 72 cuda_device_allocator.cc:30]
=== display_memory_info ===
total_cpu_ram: 1008GB total_device_ram: 79.33GB
hold_vars: 2564 lived_vars: 48379 lived_ops: 43318
name: sfrl is_device: 1 used: 75.23GB(99.5%) unused: 365.1MB(0.472%) ULB: 12.5MB ULBO: 100MB total: 75.59GB
name: sfrl is_device: 1 used: 0 B(-nan%) unused: 0 B(-nan%) total: 0 B
name: sfrl is_device: 0 used: 0 B(-nan%) unused: 0 B(-nan%) total: 0 B
name: sfrl is_device: 0 used: 2.417MB(80.6%) unused: 596.5KB(19.4%) total: 3MB
name: sfrl is_device: 0 used: 0 B(-nan%) unused: 0 B(-nan%) total: 0 B
name: temp is_device: 0 used: 0 B(-nan%) unused: 0 B(-nan%) total: 0 B
name: temp is_device: 1 used: 0 B(-nan%) unused: 0 B(-nan%) total: 0 B
cpu&gpu: 75.59GB gpu: 75.59GB cpu: 3MB
free: cpu(739.2GB) gpu(79.06MB)
swap: total( 0 B) last( 0 B)

�[m
�[38;5;1m[e 1121 22:05:43.085253 72 executor.cc:682]
=== display_memory_info ===
total_cpu_ram: 1008GB total_device_ram: 79.33GB
hold_vars: 2564 lived_vars: 48379 lived_ops: 43318
name: sfrl is_device: 1 used: 75.23GB(99.5%) unused: 365.1MB(0.472%) ULB: 12.5MB ULBO: 100MB total: 75.59GB
name: sfrl is_device: 1 used: 0 B(-nan%) unused: 0 B(-nan%) total: 0 B
name: sfrl is_device: 0 used: 0 B(-nan%) unused: 0 B(-nan%) total: 0 B
name: sfrl is_device: 0 used: 2.417MB(80.6%) unused: 596.5KB(19.4%) total: 3MB
name: sfrl is_device: 0 used: 0 B(-nan%) unused: 0 B(-nan%) total: 0 B
name: temp is_device: 0 used: 0 B(-nan%) unused: 0 B(-nan%) total: 0 B
name: temp is_device: 1 used: 0 B(-nan%) unused: 0 B(-nan%) total: 0 B
cpu&gpu: 75.59GB gpu: 75.59GB cpu: 3MB
free: cpu(739.2GB) gpu(79.06MB)
swap: total( 0 B) last( 0 B)

�[m
Traceback (most recent call last):
File "/root/workspace/Jittor/model_dis.py", line 479, in
train(model,dataloader,loss_function,optimizer,acc_step)
File "/root/workspace/Jittor/model_dis.py", line 459, in train
loss_acc+=loss.item()*acc_step
RuntimeError: Wrong inputs arguments, Please refer to examples(help(jt.item)).

Types of your inputs are:
self = Var,
args = (),

The function declarations are:
ItemData item()

Failed reason:�[38;5;1m[f 1121 22:05:43.085287 72 mem_info.cc:272]

GPU memory is overflow, please reduce your batch_size or data size!
Total: 79.33GB Used: 75.59GB�[m
�[38;5;2m[i 1121 22:05:43.093305 24 cuda_device_allocator.cc:30]
=== display_memory_info ===
total_cpu_ram: 1008GB total_device_ram: 79.33GB
hold_vars: 2564 lived_vars: 61801 lived_ops: 55684
name: sfrl is_device: 1 used: 78.04GB(99.5%) unused: 372MB(0.463%) ULB: 13.75MB ULBO: 20MB total: 78.4GB
name: sfrl is_device: 1 used: 0 B(-nan%) unused: 0 B(-nan%) total: 0 B
name: sfrl is_device: 0 used: 0 B(-nan%) unused: 0 B(-nan%) total: 0 B
name: sfrl is_device: 0 used: 3.152MB(78.8%) unused: 868.5KB(21.2%) total: 4MB
name: sfrl is_device: 0 used: 0 B(-nan%) unused: 0 B(-nan%) total: 0 B
name: temp is_device: 0 used: 0 B(-nan%) unused: 0 B(-nan%) total: 0 B
name: temp is_device: 1 used: 0 B(-nan%) unused: 0 B(-nan%) total: 0 B
cpu&gpu: 78.41GB gpu: 78.4GB cpu: 4MB
free: cpu(739.2GB) gpu(28.69MB)
swap: total( 0 B) last( 0 B)

�[m
�[38;5;2m[i 1121 22:05:43.114380 60 cuda_device_allocator.cc:30]
=== display_memory_info ===
total_cpu_ram: 1008GB total_device_ram: 79.33GB
hold_vars: 2564 lived_vars: 61801 lived_ops: 55684
name: sfrl is_device: 1 used: 78.04GB(99.5%) unused: 370MB(0.461%) ULB: 13.75MB ULBO: 20MB total: 78.4GB
name: sfrl is_device: 1 used: 0 B(-nan%) unused: 0 B(-nan%) total: 0 B
name: sfrl is_device: 0 used: 0 B(-nan%) unused: 0 B(-nan%) total: 0 B
name: sfrl is_device: 0 used: 3.152MB(78.8%) unused: 868.5KB(21.2%) total: 4MB
name: sfrl is_device: 0 used: 0 B(-nan%) unused: 0 B(-nan%) total: 0 B
name: temp is_device: 0 used: 0 B(-nan%) unused: 0 B(-nan%) total: 0 B
name: temp is_device: 1 used: 0 B(-nan%) unused: 0 B(-nan%) total: 0 B
cpu&gpu: 78.4GB gpu: 78.4GB cpu: 4MB
free: cpu(739.2GB) gpu(30.69MB)
swap: total( 0 B) last( 0 B)

�[m
�[38;5;1m[e 1121 22:05:43.114482 60 executor.cc:682]
=== display_memory_info ===
total_cpu_ram: 1008GB total_device_ram: 79.33GB
hold_vars: 2564 lived_vars: 61801 lived_ops: 55684
name: sfrl is_device: 1 used: 78.04GB(99.5%) unused: 370MB(0.461%) ULB: 13.75MB ULBO: 20MB total: 78.4GB
name: sfrl is_device: 1 used: 0 B(-nan%) unused: 0 B(-nan%) total: 0 B
name: sfrl is_device: 0 used: 0 B(-nan%) unused: 0 B(-nan%) total: 0 B
name: sfrl is_device: 0 used: 3.152MB(78.8%) unused: 868.5KB(21.2%) total: 4MB
name: sfrl is_device: 0 used: 0 B(-nan%) unused: 0 B(-nan%) total: 0 B
name: temp is_device: 0 used: 0 B(-nan%) unused: 0 B(-nan%) total: 0 B
name: temp is_device: 1 used: 0 B(-nan%) unused: 0 B(-nan%) total: 0 B
cpu&gpu: 78.4GB gpu: 78.4GB cpu: 4MB
free: cpu(739.2GB) gpu(30.69MB)
swap: total( 0 B) last( 0 B)

�[m
Traceback (most recent call last):
File "/root/workspace/Jittor/model_dis.py", line 479, in
train(model,dataloader,loss_function,optimizer,acc_step)
File "/root/workspace/Jittor/model_dis.py", line 459, in train
loss_acc+=loss.item()*acc_step
RuntimeError: Wrong inputs arguments, Please refer to examples(help(jt.item)).

Types of your inputs are:
self = Var,
args = (),

The function declarations are:
ItemData item()

Failed reason:�[38;5;1m[f 1121 22:05:43.114551 60 mem_info.cc:272]

GPU memory is overflow, please reduce your batch_size or data size!
Total: 79.33GB Used: 78.4GB�[m
�[38;5;2m[i 1121 22:05:43.117766 44 cuda_device_allocator.cc:30]
=== display_memory_info ===
total_cpu_ram: 1008GB total_device_ram: 79.33GB
hold_vars: 2564 lived_vars: 61801 lived_ops: 55684
name: sfrl is_device: 1 used: 78.04GB(99.5%) unused: 372MB(0.463%) ULB: 13.75MB ULBO: 20MB total: 78.4GB
name: sfrl is_device: 1 used: 0 B(-nan%) unused: 0 B(-nan%) total: 0 B
name: sfrl is_device: 0 used: 0 B(-nan%) unused: 0 B(-nan%) total: 0 B
name: sfrl is_device: 0 used: 3.152MB(78.8%) unused: 868.5KB(21.2%) total: 4MB
name: sfrl is_device: 0 used: 0 B(-nan%) unused: 0 B(-nan%) total: 0 B
name: temp is_device: 0 used: 0 B(-nan%) unused: 0 B(-nan%) total: 0 B
name: temp is_device: 1 used: 0 B(-nan%) unused: 0 B(-nan%) total: 0 B
cpu&gpu: 78.41GB gpu: 78.4GB cpu: 4MB
free: cpu(739.2GB) gpu(28.69MB)
swap: total( 0 B) last( 0 B)

�[m
�[38;5;2m[i 1121 22:05:43.125038 88 cuda_device_allocator.cc:30]
=== display_memory_info ===
total_cpu_ram: 1008GB total_device_ram: 79.33GB
hold_vars: 2564 lived_vars: 61799 lived_ops: 55684
name: sfrl is_device: 1 used: 78.14GB(99.5%) unused: 370MB(0.46%) ULB: 13.75MB ULBO: 20MB total: 78.5GB
name: sfrl is_device: 1 used: 0 B(-nan%) unused: 0 B(-nan%) total: 0 B
name: sfrl is_device: 0 used: 0 B(-nan%) unused: 0 B(-nan%) total: 0 B
name: sfrl is_device: 0 used: 3.152MB(78.8%) unused: 868.5KB(21.2%) total: 4MB
name: sfrl is_device: 0 used: 0 B(-nan%) unused: 0 B(-nan%) total: 0 B
name: temp is_device: 0 used: 0 B(-nan%) unused: 0 B(-nan%) total: 0 B
name: temp is_device: 1 used: 0 B(-nan%) unused: 0 B(-nan%) total: 0 B
cpu&gpu: 78.5GB gpu: 78.5GB cpu: 4MB
free: cpu(739.3GB) gpu(24.69MB)
swap: total( 0 B) last( 0 B)

�[m
�[38;5;1m[e 1121 22:05:43.125116 88 executor.cc:682]
=== display_memory_info ===
total_cpu_ram: 1008GB total_device_ram: 79.33GB
hold_vars: 2564 lived_vars: 61799 lived_ops: 55684
name: sfrl is_device: 1 used: 78.14GB(99.5%) unused: 370MB(0.46%) ULB: 13.75MB ULBO: 20MB total: 78.5GB
name: sfrl is_device: 1 used: 0 B(-nan%) unused: 0 B(-nan%) total: 0 B
name: sfrl is_device: 0 used: 0 B(-nan%) unused: 0 B(-nan%) total: 0 B
name: sfrl is_device: 0 used: 3.152MB(78.8%) unused: 868.5KB(21.2%) total: 4MB
name: sfrl is_device: 0 used: 0 B(-nan%) unused: 0 B(-nan%) total: 0 B
name: temp is_device: 0 used: 0 B(-nan%) unused: 0 B(-nan%) total: 0 B
name: temp is_device: 1 used: 0 B(-nan%) unused: 0 B(-nan%) total: 0 B
cpu&gpu: 78.5GB gpu: 78.5GB cpu: 4MB
free: cpu(739.3GB) gpu(24.69MB)
swap: total( 0 B) last( 0 B)

�[m
Traceback (most recent call last):
File "/root/workspace/Jittor/model_dis.py", line 479, in
train(model,dataloader,loss_function,optimizer,acc_step)
File "/root/workspace/Jittor/model_dis.py", line 459, in train
loss_acc+=loss.item()*acc_step
RuntimeError: Wrong inputs arguments, Please refer to examples(help(jt.item)).

Types of your inputs are:
self = Var,
args = (),

The function declarations are:
ItemData item()

Failed reason:�[38;5;1m[f 1121 22:05:43.125153 88 mem_info.cc:272]

GPU memory is overflow, please reduce your batch_size or data size!
Total: 79.33GB Used: 78.5GB�[m
�[38;5;2m[i 1121 22:05:43.163479 80 cuda_device_allocator.cc:30]
=== display_memory_info ===
total_cpu_ram: 1008GB total_device_ram: 79.33GB
hold_vars: 2564 lived_vars: 61801 lived_ops: 55684
name: sfrl is_device: 1 used: 78.04GB(99.5%) unused: 370MB(0.461%) ULB: 13.75MB ULBO: 20MB total: 78.4GB
name: sfrl is_device: 1 used: 0 B(-nan%) unused: 0 B(-nan%) total: 0 B
name: sfrl is_device: 0 used: 0 B(-nan%) unused: 0 B(-nan%) total: 0 B
name: sfrl is_device: 0 used: 3.152MB(78.8%) unused: 868.5KB(21.2%) total: 4MB
name: sfrl is_device: 0 used: 0 B(-nan%) unused: 0 B(-nan%) total: 0 B
name: temp is_device: 0 used: 0 B(-nan%) unused: 0 B(-nan%) total: 0 B
name: temp is_device: 1 used: 0 B(-nan%) unused: 0 B(-nan%) total: 0 B
cpu&gpu: 78.4GB gpu: 78.4GB cpu: 4MB
free: cpu(739.4GB) gpu(30.69MB)
swap: total( 0 B) last( 0 B)

�[m
�[38;5;1m[e 1121 22:05:43.163680 80 executor.cc:682]
=== display_memory_info ===
total_cpu_ram: 1008GB total_device_ram: 79.33GB
hold_vars: 2564 lived_vars: 61801 lived_ops: 55684
name: sfrl is_device: 1 used: 78.04GB(99.5%) unused: 370MB(0.461%) ULB: 13.75MB ULBO: 20MB total: 78.4GB
name: sfrl is_device: 1 used: 0 B(-nan%) unused: 0 B(-nan%) total: 0 B
name: sfrl is_device: 0 used: 0 B(-nan%) unused: 0 B(-nan%) total: 0 B
name: sfrl is_device: 0 used: 3.152MB(78.8%) unused: 868.5KB(21.2%) total: 4MB
name: sfrl is_device: 0 used: 0 B(-nan%) unused: 0 B(-nan%) total: 0 B
name: temp is_device: 0 used: 0 B(-nan%) unused: 0 B(-nan%) total: 0 B
name: temp is_device: 1 used: 0 B(-nan%) unused: 0 B(-nan%) total: 0 B
cpu&gpu: 78.4GB gpu: 78.4GB cpu: 4MB
free: cpu(739.4GB) gpu(30.69MB)
swap: total( 0 B) last( 0 B)

�[m
Traceback (most recent call last):
File "/root/workspace/Jittor/model_dis.py", line 479, in
train(model,dataloader,loss_function,optimizer,acc_step)
File "/root/workspace/Jittor/model_dis.py", line 459, in train
loss_acc+=loss.item()*acc_step
RuntimeError: Wrong inputs arguments, Please refer to examples(help(jt.item)).

Types of your inputs are:
self = Var,
args = (),

The function declarations are:
ItemData item()

Failed reason:�[38;5;1m[f 1121 22:05:43.163780 80 mem_info.cc:272]

GPU memory is overflow, please reduce your batch_size or data size!
Total: 79.33GB Used: 78.4GB�[m
�[38;5;2m[i 1121 22:05:43.176540 24 cuda_device_allocator.cc:30]
=== display_memory_info ===
total_cpu_ram: 1008GB total_device_ram: 79.33GB
hold_vars: 2564 lived_vars: 61801 lived_ops: 55684
name: sfrl is_device: 1 used: 78.04GB(99.5%) unused: 370MB(0.461%) ULB: 13.75MB ULBO: 20MB total: 78.4GB
name: sfrl is_device: 1 used: 0 B(-nan%) unused: 0 B(-nan%) total: 0 B
name: sfrl is_device: 0 used: 0 B(-nan%) unused: 0 B(-nan%) total: 0 B
name: sfrl is_device: 0 used: 3.152MB(78.8%) unused: 868.5KB(21.2%) total: 4MB
name: sfrl is_device: 0 used: 0 B(-nan%) unused: 0 B(-nan%) total: 0 B
name: temp is_device: 0 used: 0 B(-nan%) unused: 0 B(-nan%) total: 0 B
name: temp is_device: 1 used: 0 B(-nan%) unused: 0 B(-nan%) total: 0 B
cpu&gpu: 78.4GB gpu: 78.4GB cpu: 4MB
free: cpu(739.4GB) gpu(30.69MB)
swap: total( 0 B) last( 0 B)

�[m
�[38;5;1m[e 1121 22:05:43.176716 24 executor.cc:682]
=== display_memory_info ===
total_cpu_ram: 1008GB total_device_ram: 79.33GB
hold_vars: 2564 lived_vars: 61801 lived_ops: 55684
name: sfrl is_device: 1 used: 78.04GB(99.5%) unused: 370MB(0.461%) ULB: 13.75MB ULBO: 20MB total: 78.4GB
name: sfrl is_device: 1 used: 0 B(-nan%) unused: 0 B(-nan%) total: 0 B
name: sfrl is_device: 0 used: 0 B(-nan%) unused: 0 B(-nan%) total: 0 B
name: sfrl is_device: 0 used: 3.152MB(78.8%) unused: 868.5KB(21.2%) total: 4MB
name: sfrl is_device: 0 used: 0 B(-nan%) unused: 0 B(-nan%) total: 0 B
name: temp is_device: 0 used: 0 B(-nan%) unused: 0 B(-nan%) total: 0 B
name: temp is_device: 1 used: 0 B(-nan%) unused: 0 B(-nan%) total: 0 B
cpu&gpu: 78.4GB gpu: 78.4GB cpu: 4MB
free: cpu(739.4GB) gpu(30.69MB)
swap: total( 0 B) last( 0 B)

�[m
Traceback (most recent call last):
File "/root/workspace/Jittor/model_dis.py", line 479, in
train(model,dataloader,loss_function,optimizer,acc_step)
File "/root/workspace/Jittor/model_dis.py", line 459, in train
loss_acc+=loss.item()*acc_step
RuntimeError: Wrong inputs arguments, Please refer to examples(help(jt.item)).

Types of your inputs are:
self = Var,
args = (),

The function declarations are:
ItemData item()

Failed reason:�[38;5;1m[f 1121 22:05:43.176779 24 mem_info.cc:272]

GPU memory is overflow, please reduce your batch_size or data size!
Total: 79.33GB Used: 78.4GB�[m
�[38;5;2m[i 1121 22:05:43.198657 44 cuda_device_allocator.cc:30]
=== display_memory_info ===
total_cpu_ram: 1008GB total_device_ram: 79.33GB
hold_vars: 2564 lived_vars: 61801 lived_ops: 55684
name: sfrl is_device: 1 used: 78.04GB(99.5%) unused: 370MB(0.461%) ULB: 13.75MB ULBO: 20MB total: 78.4GB
name: sfrl is_device: 1 used: 0 B(-nan%) unused: 0 B(-nan%) total: 0 B
name: sfrl is_device: 0 used: 0 B(-nan%) unused: 0 B(-nan%) total: 0 B
name: sfrl is_device: 0 used: 3.152MB(78.8%) unused: 868.5KB(21.2%) total: 4MB
name: sfrl is_device: 0 used: 0 B(-nan%) unused: 0 B(-nan%) total: 0 B
name: temp is_device: 0 used: 0 B(-nan%) unused: 0 B(-nan%) total: 0 B
name: temp is_device: 1 used: 0 B(-nan%) unused: 0 B(-nan%) total: 0 B
cpu&gpu: 78.4GB gpu: 78.4GB cpu: 4MB
free: cpu(739.6GB) gpu(30.69MB)
swap: total( 0 B) last( 0 B)

�[m
�[38;5;1m[e 1121 22:05:43.198745 44 executor.cc:682]
=== display_memory_info ===
total_cpu_ram: 1008GB total_device_ram: 79.33GB
hold_vars: 2564 lived_vars: 61801 lived_ops: 55684
name: sfrl is_device: 1 used: 78.04GB(99.5%) unused: 370MB(0.461%) ULB: 13.75MB ULBO: 20MB total: 78.4GB
name: sfrl is_device: 1 used: 0 B(-nan%) unused: 0 B(-nan%) total: 0 B
name: sfrl is_device: 0 used: 0 B(-nan%) unused: 0 B(-nan%) total: 0 B
name: sfrl is_device: 0 used: 3.152MB(78.8%) unused: 868.5KB(21.2%) total: 4MB
name: sfrl is_device: 0 used: 0 B(-nan%) unused: 0 B(-nan%) total: 0 B
name: temp is_device: 0 used: 0 B(-nan%) unused: 0 B(-nan%) total: 0 B
name: temp is_device: 1 used: 0 B(-nan%) unused: 0 B(-nan%) total: 0 B
cpu&gpu: 78.4GB gpu: 78.4GB cpu: 4MB
free: cpu(739.6GB) gpu(30.69MB)
swap: total( 0 B) last( 0 B)

�[m
Traceback (most recent call last):
File "/root/workspace/Jittor/model_dis.py", line 479, in
train(model,dataloader,loss_function,optimizer,acc_step)
File "/root/workspace/Jittor/model_dis.py", line 459, in train
loss_acc+=loss.item()*acc_step
RuntimeError: Wrong inputs arguments, Please refer to examples(help(jt.item)).

Types of your inputs are:
self = Var,
args = (),

The function declarations are:
ItemData item()

Failed reason:�[38;5;1m[f 1121 22:05:43.198783 44 mem_info.cc:272]

GPU memory is overflow, please reduce your batch_size or data size!
Total: 79.33GB Used: 78.4GB�[m

Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.

mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

Process name: [[19811,1],6]
Exit code: 1

Provide a full log of Jittor execution, Jittor will log environment information which help us to locate your bugs. Provide a screenshot is also acceptable.

Minimal Reproduce

import math
import os
from typing import Optional, Tuple, Union, List, Set, OrderedDict, Dict, Any
import warnings
import jittor as jt
from jittor import nn
from dataclasses import dataclass
import numpy as np
from dataset import genDataloader,genDataloader_dis
import lr_scheduler

jt.flags.use_cuda = 1

@DataClass
class GPT2config:
n_positions:int =1024
hidden_size:int =1600
num_attention_heads:int =25
vocab_size:int =50257
max_position_embeddings:int =1024
embd_pdrop:float =0.0
num_hidden_layers:int =48
activation_function:str="gelu_new"

scale_attn_weights:bool=True
scale_attn_by_inverse_layer_idx:bool=False
reorder_and_upcast_attn:bool=False
attn_pdrop:float=0.0
resid_pdrop:float=0.0
n_inner:object=None
initializer_range:float=0.02
layer_norm_epsilon:float=1e-5
use_cache:bool=False
add_cross_attention:bool=False

@DataClass
class OptimizerConfig:
lr:float=1.5e-4
wd:float=0.1

class Conv1D(nn.Module):
def init(self, nf, nx):
super().init()
self.nf = nf
self.weight = jt.normal(0,0.02,size=(nx, nf))
self.bias = jt.zeros(nf)

def execute(self, x):
    shape_out = x.shape[:-1] + (self.nf,)
    x = jt.matmul(x.reshape(-1, x.shape[-1]), self.weight) + self.bias
    x = x.reshape(shape_out)
    return x

class GPT2Attention(jt.nn.Module):
def init(self, config, is_cross_attention=False, layer_idx=None) -> None:
super().init()

    max_positions = config.n_positions
    self.bias = jt.tril(jt.ones((max_positions, max_positions), dtype=jt.uint8)).reshape(
        (1,1,max_positions,max_positions)
        ).stop_grad() #lower trangle matrix
    # self.masked_bias = jt.array(-1e4)

    self.embed_dim = config.hidden_size
    self.num_heads = config.num_attention_heads
    self.head_dim = self.embed_dim // self.num_heads
    self.split_size = self.embed_dim
    if self.head_dim * self.num_heads != self.embed_dim:
        raise ValueError(
            f"`embed_dim` must be divisible by num_heads (got `embed_dim`: {self.embed_dim} and `num_heads`:"
            f" {self.num_heads})."
        )

    self.scale_attn_weights = config.scale_attn_weights
    self.is_cross_attention = is_cross_attention

    # Layer-wise attention scaling, reordering, and upcasting
    self.scale_attn_by_inverse_layer_idx = config.scale_attn_by_inverse_layer_idx
    self.layer_idx = layer_idx
    self.reorder_and_upcast_attn = config.reorder_and_upcast_attn

    if self.is_cross_attention:
        self.c_attn = Conv1D(2 * self.embed_dim, self.embed_dim)
        self.q_attn = Conv1D(self.embed_dim, self.embed_dim)
    else:
        self.c_attn = Conv1D(3 * self.embed_dim, self.embed_dim)
    self.c_proj = Conv1D(self.embed_dim, self.embed_dim)

    self.attn_dropout = nn.Dropout(config.attn_pdrop)
    self.resid_dropout = nn.Dropout(config.resid_pdrop)

    self.pruned_heads = set()
    

def _attn(self, query, key, value, attention_mask=None, head_mask=None):
    attn_weights = jt.matmul(query, key.transpose(-1, -2))

    if self.scale_attn_weights:
        attn_weights = attn_weights / jt.full(
            [], value.shape[-1] ** 0.5, dtype=attn_weights.dtype
        )

    # Layer-wise attention scaling
    if self.scale_attn_by_inverse_layer_idx:
        attn_weights = attn_weights / float(self.layer_idx + 1)

    if not self.is_cross_attention:
        # if only "normal" attention layer implements causal mask
        query_length, key_length = query.size(-2), key.size(-2)
        batch_size, num_heads = query.size(0), query.size(1)
        causal_mask = self.bias[:, :, key_length - query_length : key_length, :key_length].bool()
        causal_mask = causal_mask.repeat(batch_size, num_heads, 1 , 1)
        mask_value = -3.4028234663852886e+38
        
        mask_value = jt.full([], mask_value, dtype=attn_weights.dtype)
        attn_weights = jt.where(causal_mask, attn_weights, mask_value)

    if attention_mask is not None:
        # Apply the attention mask
        attn_weights = attn_weights + attention_mask

    attn_weights = nn.softmax(attn_weights, dim=-1)

    attn_weights = self.attn_dropout(attn_weights)

    # Mask heads if we want to
    if head_mask is not None:
        attn_weights = attn_weights * head_mask

    attn_output = jt.matmul(attn_weights, value)
  
    return attn_output, attn_weights

def _upcast_and_reordered_attn(self, query, key, value, attention_mask=None, head_mask=None):
    bsz, num_heads, q_seq_len, dk = query.size()
    _, _, k_seq_len, _ = key.size()

    attn_weights = jt.empty(bsz * num_heads, q_seq_len, k_seq_len, dtype=jt.float32)

    # Compute Scale Factor
    scale_factor = 1.0
    if self.scale_attn_weights:
        scale_factor /= float(value.size(-1)) ** 0.5

    if self.scale_attn_by_inverse_layer_idx:
        scale_factor /= float(self.layer_idx + 1)

    # Upcast (turn off autocast) and reorder (Scale K by 1 / root(dk))
    
    q, k = query.reshape(-1, q_seq_len, dk), key.transpose(-1, -2).reshape(-1, dk, k_seq_len)
    attn_weights = scale_factor * jt.bmm(q.float(),k.float())
   
    attn_weights = attn_weights.reshape(bsz, num_heads, q_seq_len, k_seq_len)

    if not self.is_cross_attention:
        # if only "normal" attention layer implements causal mask
        query_length, key_length = query.size(-2), key.size(-2)
        causal_mask = self.bias[:, :, key_length - query_length : key_length, :key_length].bool()
        mask_value = 1e-32 #torch.finfo(attn_weights.dtype).min
        
        mask_value = jt.array(mask_value, dtype=attn_weights.dtype)
        attn_weights = jt.where(causal_mask, attn_weights, mask_value)

    if attention_mask is not None:
        # Apply the attention mask
        attn_weights = attn_weights + attention_mask

    attn_weights = nn.functional.softmax(attn_weights, dim=-1)

   
    if attn_weights.dtype != jt.float32:
        raise RuntimeError("Error with upcasting, attn_weights does not have dtype jt.float32")
    attn_weights = attn_weights.type(value.dtype)
    attn_weights = self.attn_dropout(attn_weights)

    # Mask heads if we want to
    if head_mask is not None:
        attn_weights = attn_weights * head_mask

    attn_output = jt.matmul(attn_weights, value)

    return attn_output, attn_weights

def _split_heads(self, tensor, num_heads, attn_head_size):
    """
    Splits hidden_size dim into attn_head_size and num_heads
    """
    new_shape = tensor.size()[:-1] + (num_heads, attn_head_size)
    tensor = tensor.view(new_shape)
    return tensor.permute(0, 2, 1, 3)  # (batch, head, seq_length, head_features)

def _merge_heads(self, tensor, num_heads, attn_head_size):
    """
    Merges attn_head_size dim and num_attn_heads dim into hidden_size
    """
    tensor = tensor.permute(0, 2, 1, 3).contiguous()
    new_shape = tensor.size()[:-2] + (num_heads * attn_head_size,)
    return tensor.view(new_shape)

def execute(
    self,
    hidden_states: Optional[Tuple[jt.array]], #[batch_size, seq_len, embedding_size]
    layer_past: Optional[Tuple[jt.array]] = None,
    attention_mask: Optional[jt.array] = None,
    head_mask: Optional[jt.array] = None,
    encoder_hidden_states: Optional[jt.array] = None,
    encoder_attention_mask: Optional[jt.array] = None,
    use_cache: Optional[bool] = False,
    output_attentions: Optional[bool] = False,
) -> Tuple[Union[jt.array, Tuple[jt.array]], ...]:
    if encoder_hidden_states is not None:
        if not hasattr(self, "q_attn"):
            raise ValueError(
                "If class is used as cross attention, the weights `q_attn` have to be defined. "
                "Please make sure to instantiate class with `GPT2Attention(..., is_cross_attention=True)`."
            )

        query = self.q_attn(hidden_states)
        print(self.c_attn(encoder_hidden_states).shape)
        key, value = self.c_attn(encoder_hidden_states).split(self.split_size, dim=2)
        attention_mask = encoder_attention_mask
    else:
        query, key, value = self.c_attn(hidden_states).split(self.split_size, dim=2)

    query = self._split_heads(query, self.num_heads, self.head_dim)
    key = self._split_heads(key, self.num_heads, self.head_dim)
    value = self._split_heads(value, self.num_heads, self.head_dim)
    
    if layer_past is not None:
        past_key, past_value = layer_past
        key = jt.concat((past_key, key), dim=-2)
        value = jt.concat((past_value, value), dim=-2)

    if use_cache is True:
        present = (key, value)
    else:
        present = None

    if self.reorder_and_upcast_attn:
        attn_output, attn_weights = self._upcast_and_reordered_attn(query, key, value, attention_mask, head_mask)
    else:
        attn_output, attn_weights = self._attn(query, key, value, attention_mask, head_mask)

    attn_output = self._merge_heads(attn_output, self.num_heads, self.head_dim)
    attn_output = self.c_proj(attn_output)
    attn_output = self.resid_dropout(attn_output)

    outputs = (attn_output, present)
    if output_attentions:
        outputs += (attn_weights,)

    return outputs  # a, present, (attentions)

class ClassInstantier(OrderedDict):
def getitem(self, key):
content = super().getitem(key)
cls, kwargs = content if isinstance(content, tuple) else (content, {})
return cls(**kwargs)

#有问题，会使输出为nan
class NewGELUActivation(nn.Module):
"""
Implementation of the GELU activation function currently in Google BERT repo (identical to OpenAI GPT). Also see
the Gaussian Error Linear Units paper: https://arxiv.org/abs/1606.08415
"""
def init(self, *args, **kw) -> None:
super().init(*args, **kw)
def execute(self, input: jt.array) -> jt.array:
return 0.5 * input * (1.0 + jt.tanh(math.sqrt(2.0 / math.pi) * (input + 0.044715 * jt.pow(input, 3.0))))

ACT2CLS = {
"gelu_new": NewGELUActivation,
"relu": nn.ReLU,
"relu6": nn.ReLU6,
"sigmoid": nn.Sigmoid,
"tanh": nn.Tanh,
}
ACT2FN = ClassInstantier(ACT2CLS)

%%

class GPT2MLP(nn.Module):
def init(self, intermediate_size, config):
super().init()
embed_dim = config.hidden_size
self.c_fc = Conv1D(intermediate_size, embed_dim)
self.c_proj = Conv1D(embed_dim, intermediate_size)
self.act=nn.GELU()
# self.act = ACT2FN[config.activation_function]
self.dropout = nn.Dropout(config.resid_pdrop)

def execute(self, hidden_states: Optional[Tuple[jt.array]]) -> jt.array:
    hidden_states = self.c_fc(hidden_states)
    hidden_states = self.act(hidden_states)
    hidden_states = self.c_proj(hidden_states)
    hidden_states = self.dropout(hidden_states)
    return hidden_states

class GPT2Block(nn.Module):
def init(self, config, layer_idx=None):
super().init()
hidden_size = config.hidden_size
inner_dim = config.n_inner if config.n_inner is not None else 4 * hidden_size #intermediate

    self.ln_1 = nn.LayerNorm(hidden_size, eps=config.layer_norm_epsilon)
    self.attn = GPT2Attention(config, layer_idx=layer_idx)
    self.ln_2 = nn.LayerNorm(hidden_size, eps=config.layer_norm_epsilon)

    if config.add_cross_attention:
        self.crossattention = GPT2Attention(config, is_cross_attention=True, layer_idx=layer_idx)
        self.ln_cross_attn = nn.LayerNorm(hidden_size, eps=config.layer_norm_epsilon)

    self.mlp = GPT2MLP(inner_dim, config)

def execute(
    self,
    hidden_states: Optional[Tuple[jt.array]],
    layer_past: Optional[Tuple[jt.array]] = None,
    attention_mask: Optional[jt.array] = None,
    head_mask: Optional[jt.array] = None,
    encoder_hidden_states: Optional[jt.array] = None,
    encoder_attention_mask: Optional[jt.array] = None,
    use_cache: Optional[bool] = False,
    output_attentions: Optional[bool] = False,
) -> Union[Tuple[jt.array], Optional[Tuple[jt.array, Tuple[jt.array, ...]]]]:
    residual = hidden_states
    hidden_states = self.ln_1(hidden_states)
    # ttt1 = time.time()
    attn_outputs = self.attn(
        hidden_states,
        layer_past=layer_past,
        attention_mask=attention_mask,
        head_mask=head_mask,
        use_cache=use_cache,
        output_attentions=output_attentions,
    )
    # ttt2 = time.time()
    # print('small attn ',ttt2-ttt1)
    attn_output = attn_outputs[0]  # output_attn: a, present, (attentions)
    outputs = attn_outputs[1:]
    # residual connection
    hidden_states = attn_output + residual

    if encoder_hidden_states is not None:
        # add one self-attention block for cross-attention
        if not hasattr(self, "crossattention"):
            raise ValueError(
                f"If `encoder_hidden_states` are passed, {self} has to be instantiated with "
                "cross-attention layers by setting `config.add_cross_attention=True`"
            )
        residual = hidden_states
        hidden_states = self.ln_cross_attn(hidden_states)
        cross_attn_outputs = self.crossattention(
            hidden_states,
            attention_mask=attention_mask,
            head_mask=head_mask,
            encoder_hidden_states=encoder_hidden_states,
            encoder_attention_mask=encoder_attention_mask,
            output_attentions=output_attentions,
        )
        attn_output = cross_attn_outputs[0]
        # residual connection
        hidden_states = residual + attn_output
        outputs = outputs + cross_attn_outputs[2:]  # add cross attentions if we output attention weights

    residual = hidden_states
    hidden_states = self.ln_2(hidden_states)
    feed_forward_hidden_states = self.mlp(hidden_states)
    # residual connection
    hidden_states = residual + feed_forward_hidden_states

    if use_cache:
        outputs = (hidden_states,) + outputs
    else:
        outputs = (hidden_states,) + outputs[1:]

    return outputs  # hidden_states, present, (attentions, cross_attentions)

class GPT2(nn.Module):
def init(self, config:GPT2config):
super().init()
self.config=config
self.embed_dim = config.hidden_size

    self.wte = nn.Embedding(config.vocab_size, self.embed_dim)
    self.wpe = nn.Embedding(config.max_position_embeddings, self.embed_dim)

    self.drop = nn.Dropout(config.embd_pdrop)
    self.h = nn.ModuleList([GPT2Block(config, layer_idx=i) for i in range(config.num_hidden_layers)])
    self.ln_f = nn.LayerNorm(self.embed_dim, eps=config.layer_norm_epsilon)
    #attention中已经加了
    # self.attention_mask=jt.array(np.load("./attention_mask.npy"))

    # Model parallel
    self.model_parallel = False
    self.device_map = None
    self.gradient_checkpointing = False


def execute(self,input_ids):
    input_shape = input_ids.size()
    input_ids = input_ids.view(-1, input_shape[-1])
    batch_size = input_ids.shape[0]
    position_ids = jt.arange(0, self.config.max_position_embeddings, dtype=jt.int64)
    position_ids = position_ids.unsqueeze(0).view(-1, input_shape[-1])
    pos_emb=self.wpe(position_ids)
    tok_emb=self.wte(input_ids)
    x=pos_emb+tok_emb
    for block in self.h:
        x=block(x)[0]
    x=self.ln_f(x)
    logits=jt.matmul(x,jt.transpose(self.wte.weight,[1,0]))

    return logits

dummy_input=jt.arange(0,1024,dtype=jt.int64)

config=GPT2config()
model=GPT2(config)

y=model(dummy_input)

dataloader=genDataloader_dis(1*8)#调整，需要注意的是Dataset类中设置的Batch size是所有节点的batch size之和，也就是总batch size，不是单个节点接收到的batch size。
loss_function=nn.CrossEntropyLoss()
configOpt=OptimizerConfig()
optimizer=nn.AdamW(model.parameters(),lr=configOpt.lr,weight_decay=configOpt.wd)
lr_scheduler=lr_scheduler.Anneals_Cosine_LR_Scheduler()
acc_step=8 #调整

def train(model,train_loader,loss_function,optimizer,acc_step):
model.train()
loss_acc=0.0
real_step=0
for step,(inputs,label) in enumerate(train_loader):
# if jt.rank==0:
# print(inputs.shape)
output=model(inputs)
loss=loss_function(output.view(-1,output.size(-1)),label.view(-1))/acc_step
loss_acc+=loss.item()*acc_step
# loss_acc+=0

    optimizer.backward(loss)
    if (step+1)%acc_step==0:
        avg_loss=loss_acc/acc_step
        # for param_group in optimizer.param_groups:
        #     lr=lr_scheduler.step_lr()
        #     param_group["lr"]=lr
        optimizer.clip_grad_norm(1.0)
        optimizer.step()

        if jt.rank==0:      
            print(avg_loss,flush=True)
        loss_acc=0.0
        real_step+=1
        if(real_step==1001):
            print("done!",flush=True)
            break

train(model,dataloader,loss_function,optimizer,acc_step)

Reproduce this error with a file or several lines of code.
If it is not possible, leave it blank.

Expected behavior

A clear and concise description of what you expected to happen.

If you are submitting an issue for the first time, please refer to our guideline

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

optimize.backward 使用梯度累加显存占用和梯度累加步长成正比，导致显存占用过大无法训练GPT2-XL #602

optimize.backward 使用梯度累加显存占用和梯度累加步长成正比，导致显存占用过大无法训练GPT2-XL #602

peter-ni-noob commented Nov 21, 2024

optimize.backward 使用梯度累加显存占用和梯度累加步长成正比，导致显存占用过大无法训练GPT2-XL #602

optimize.backward 使用梯度累加显存占用和梯度累加步长成正比，导致显存占用过大无法训练GPT2-XL #602

Comments

peter-ni-noob commented Nov 21, 2024

Describe the bug

Full Log

GPU memory is overflow, please reduce your batch_size or data size! Total: 79.33GB Used: 78.4GB�[m

Primary job terminated normally, but 1 process returned a non-zero exit code. Per user-direction, the job has been aborted.

Process name: [[19811,1],6] Exit code: 1

Minimal Reproduce

%%

dummy_input=jt.arange(0,1024,dtype=jt.int64)

y=model(dummy_input)

Expected behavior

GPU memory is overflow, please reduce your batch_size or data size!
Total: 79.33GB Used: 78.4GB�[m

Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.

Process name: [[19811,1],6]
Exit code: 1