Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Summary: This PR add a nsys report analyzer providing metrics ```python nsys_metrics_to_reports = { # the sum of kernel execution time "nsys_gpu_kernel_sum": ["cuda_gpu_kern_sum", "nvtx_sum"], # the overhead of kernel launch "nsys_launch_overhead": ["cuda_gpu_kern_sum", "nvtx_sum"], # the names of kernels "nsys_kernel_names": ["cuda_gpu_kern_sum"], # the durations of kernels "nsys_kernel_durations": ["cuda_gpu_kern_sum"], # the duration of nvtx range "nsys_nvtx_range_duration": ["nvtx_sum"], # the number of kernels "nsys_num_of_kernels": ["cuda_gpu_kern_sum"], } ``` `nsys_gpu_kernel_sum` is the sum of total GPU kernel execution time on GPUs, the `nsys_nvtx_range_duration ` is the total execution time of the operator, and the `nsys_launch_overhead` is their difference which indicates the launch overhead. This is one way to measure execution time mentioned in #50 Fix #67 Pull Request resolved: #65 Test Plan: ``` % python run.py --op rope --num-inputs 1 --metrics nsys_gpu_kernel_sum,nsys_launch_overhead,nsys_kernel_names,nsys_kernel_durations,nsys_nvtx_range_duration,nsys_num_of_kernels --csv --dump-csv 0%| | 0/1 [00:00<?, ?it/s]`LlamaRotaryEmbedding` can now be fully parameterized by passing the model config through the `config` argument. All other arguments will be removed in v4.46 0%| | 0/1 [00:00<?, ?it/s]`LlamaRotaryEmbedding` can now be fully parameterized by passing the model config through the `config` argument. All other arguments will be removed in v4.46 Capture range started in the application. Capture range ended in the application. Generating '/tmp/nsys-report-531e.qdstrm' [1/1] [0% ] nsys_output.nsys-repProcessing events... [1/1] [========================100%] nsys_output.nsys-rep Generated: /tmp/tritonbench/rope/nsys_traces/apply_rotary_pos_emb_0/nsys_output.nsys-rep 0%| | 0/1 [00:00<?, ?it/s]`LlamaRotaryEmbedding` can now be fully parameterized by passing the model config through the `config` argument. All other arguments will be removed in v4.46 Capture range started in the application. Capture range ended in the application. Generating '/tmp/nsys-report-39ea.qdstrm' [1/1] [0% ] nsys_output.nsys-repProcessing events... [1/1] [========================100%] nsys_output.nsys-rep Generated: /tmp/tritonbench/rope/nsys_traces/liger_rotary_pos_emb_0/nsys_output.nsys-rep 0%| | 0/1 [00:00<?, ?it/s]`LlamaRotaryEmbedding` can now be fully parameterized by passing the model config through the `config` argument. All other arguments will be removed in v4.46 Capture range started in the application. Capture range ended in the application. Generating '/tmp/nsys-report-e8bf.qdstrm' [1/1] [0% ] nsys_output.nsys-repProcessing events... [1/1] [========================100%] nsys_output.nsys-rep Generated: /tmp/tritonbench/rope/nsys_traces/inductor_rotary_pos_emb_full_op_0/nsys_output.nsys-rep 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:52<00:00, 52.40s/it] (H, T);apply_rotary_pos_emb-nsys_kernel_names;apply_rotary_pos_emb-nsys_kernel_durations;apply_rotary_pos_emb-nsys_gpu_kernel_sum;apply_rotary_pos_emb-nsys_num_of_kernels;apply_rotary_pos_emb-nsys_launch_overhead;apply_rotary_pos_emb-nsys_nvtx_range_duration;liger_rotary_pos_emb-nsys_kernel_names;liger_rotary_pos_emb-nsys_kernel_durations;liger_rotary_pos_emb-nsys_gpu_kernel_sum;liger_rotary_pos_emb-nsys_num_of_kernels;liger_rotary_pos_emb-nsys_launch_overhead;liger_rotary_pos_emb-nsys_nvtx_range_duration;inductor_rotary_pos_emb_full_op-nsys_kernel_names;inductor_rotary_pos_emb_full_op-nsys_kernel_durations;inductor_rotary_pos_emb_full_op-nsys_gpu_kernel_sum;inductor_rotary_pos_emb_full_op-nsys_num_of_kernels;inductor_rotary_pos_emb_full_op-nsys_launch_overhead;inductor_rotary_pos_emb_full_op-nsys_nvtx_range_duration (8192, 1024);['void at::native::elementwise_kernel<(int)128, (int)2, void at::native::gpu_kernel_impl_nocast<at::native::BinaryFunctor<float, float, float, at::native::binary_internal::MulFunctor<float>>>(at::TensorIteratorBase &, const T1 &)::[lambda(int) (instance 1)]>(int, T3)', 'void at::native::<unnamed>::CatArrayBatchedCopy<at::native::<unnamed>::OpaqueType<(unsigned int)4>, unsigned int, (int)4, (int)64, (int)64>(T1 *, at::native::<unnamed>::CatArrInputTensorMetadata<T1, T2, T4, T5>, at::native::<unnamed>::TensorSizeStride<T2, (unsigned int)4>, int, T2)', 'void at::native::elementwise_kernel<(int)128, (int)2, void at::native::gpu_kernel_impl_nocast<at::native::CUDAFunctor_add<float>>(at::TensorIteratorBase &, const T1 &)::[lambda(int) (instance 1)]>(int, T3)', 'void at::native::elementwise_kernel<(int)128, (int)2, void at::native::gpu_kernel_impl_nocast<at::native::neg_kernel_cuda(at::TensorIteratorBase &)::[lambda() (instance 2)]::operator ()() const::[lambda() (instance 7)]::operator ()() const::[lambda(float) (instance 1)]>(at::TensorIteratorBase &, const T1 &)::[lambda(int) (instance 1)]>(int, T3)'];0.090065;0.351364;4;0.4534;0.804764;['_triton_rope'];0.049281;0.049281;1;0.176437;0.225718;['triton_poi_fused_add_cat_mul_0', 'triton_poi_fused_add_cat_mul_1'];0.0266885;0.053377;2;0.444969;0.498346 [TritonBench] Dumped csv to /tmp/tritonbench/op_rope__z_yqmrz.csv ``` Reviewed By: xuzhao9 Differential Revision: D66311127 Pulled By: FindHao fbshipit-source-id: 085454e34a3e9aadb360309cc69885684a8a1758
- Loading branch information