-
Notifications
You must be signed in to change notification settings - Fork 63
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
ci: run 1.10 tests only on Lux and LuxLib
- Loading branch information
Showing
14 changed files
with
6 additions
and
37 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -23,7 +23,6 @@ steps: | |
matrix: | ||
setup: | ||
julia: | ||
- "1.10" | ||
- "1" | ||
|
||
env: | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -25,7 +25,6 @@ steps: | |
matrix: | ||
setup: | ||
julia: | ||
- "1.10" | ||
- "1" | ||
|
||
env: | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -26,7 +26,6 @@ steps: | |
matrix: | ||
setup: | ||
julia: | ||
- "1.10" | ||
- "1" | ||
|
||
- group: ":julia: (WeightInitializers) AMD GPU" | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -23,7 +23,6 @@ jobs: | |
fail-fast: false | ||
matrix: | ||
version: | ||
- "1.10" | ||
- "1" | ||
steps: | ||
- uses: actions/checkout@v4 | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -25,7 +25,6 @@ jobs: | |
fail-fast: false | ||
matrix: | ||
version: | ||
- "1.10" | ||
- "1" | ||
os: | ||
- ubuntu-latest | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -24,7 +24,6 @@ jobs: | |
fail-fast: false | ||
matrix: | ||
version: | ||
- "1.10" | ||
- "1" | ||
os: | ||
- ubuntu-latest | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -24,7 +24,6 @@ jobs: | |
fail-fast: false | ||
matrix: | ||
version: | ||
- "1.10" | ||
- "1" | ||
os: | ||
- ubuntu-latest | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -24,7 +24,6 @@ jobs: | |
fail-fast: false | ||
matrix: | ||
version: | ||
- "1.10" | ||
- "1" | ||
os: | ||
- ubuntu-latest | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
900c21c
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lux Benchmarks
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/2 thread(s)
4270.5
ns4709
ns0.91
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/4 thread(s)
4000
ns4792
ns0.83
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/8 thread(s)
5875
ns5166
ns1.14
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/1 thread(s)
4895.5
ns4416
ns1.11
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/CUDA
59833
ns60862
ns0.98
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/2 thread(s)
10375
ns10416
ns1.00
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/4 thread(s)
9958
ns9875
ns1.01
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/8 thread(s)
10792
ns11417
ns0.95
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/1 thread(s)
10125
ns10542
ns0.96
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/CUDA
422438
ns426730.5
ns0.99
bias_activation(32, act=relu)(32 x 128)/forward/CPU/2 thread(s)
1083
ns1000
ns1.08
bias_activation(32, act=relu)(32 x 128)/forward/CPU/4 thread(s)
1000
ns1333
ns0.75
bias_activation(32, act=relu)(32 x 128)/forward/CPU/8 thread(s)
1417
ns1291
ns1.10
bias_activation(32, act=relu)(32 x 128)/forward/CPU/1 thread(s)
1125
ns1395.5
ns0.81
bias_activation(32, act=relu)(32 x 128)/forward/GPU/CUDA
18109
ns18565
ns0.98
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/2 thread(s)
4166
ns4209
ns0.99
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/4 thread(s)
4125
ns4042
ns1.02
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/8 thread(s)
4187.5
ns4167
ns1.00
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/1 thread(s)
4042
ns4000
ns1.01
bias_activation(32, act=relu)(32 x 128)/zygote/GPU/CUDA
109209
ns111556
ns0.98
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
57645.5
ns56375
ns1.02
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
47000
ns46916
ns1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
38125
ns46167
ns0.83
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
82084
ns80959
ns1.01
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA
37455
ns37697
ns0.99
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
1973687
ns2046500
ns0.96
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
2089416
ns2089354
ns1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
2085625
ns2048708.5
ns1.02
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
1985813
ns1993834
ns1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
195917
ns199690
ns0.98
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
146416.5
ns147104.5
ns1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
147020.5
ns144104.5
ns1.02
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
145667
ns148584
ns0.98
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
145604.5
ns144583.5
ns1.01
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA
166391
ns165605
ns1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
1129209
ns1131291
ns1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
1126375
ns1119584
ns1.01
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
1147667
ns1111791.5
ns1.03
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
1104209
ns1118209
ns0.99
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
521058.5
ns531488
ns0.98
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/2 thread(s)
3416.5
ns3500
ns0.98
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/4 thread(s)
3333
ns3709
ns0.90
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/8 thread(s)
6333
ns5520.5
ns1.15
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/1 thread(s)
3250
ns3375
ns0.96
layernorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/CUDA
66594
ns71213
ns0.94
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/2 thread(s)
8792
ns9084
ns0.97
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/4 thread(s)
9291
ns9625
ns0.97
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/8 thread(s)
9250
ns10167
ns0.91
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/1 thread(s)
9292
ns8584
ns1.08
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/CUDA
493812
ns497375
ns0.99
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
14750
ns15458.5
ns0.95
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
15458
ns15250
ns1.01
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
19167
ns19146
ns1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
16437.5
ns14604
ns1.13
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA
53833
ns55040
ns0.98
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
215416.5
ns213833
ns1.01
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
213208.5
ns213292
ns1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
214271
ns215292
ns1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
227104
ns217500
ns1.04
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
271460
ns277020.5
ns0.98
bias_activation(2, act=relu)(2 x 128)/forward/CPU/2 thread(s)
542
ns542
ns1
bias_activation(2, act=relu)(2 x 128)/forward/CPU/4 thread(s)
625
ns583
ns1.07
bias_activation(2, act=relu)(2 x 128)/forward/CPU/8 thread(s)
792
ns709
ns1.12
bias_activation(2, act=relu)(2 x 128)/forward/CPU/1 thread(s)
583
ns625
ns0.93
bias_activation(2, act=relu)(2 x 128)/forward/GPU/CUDA
17470
ns17919
ns0.97
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/2 thread(s)
1750
ns1542
ns1.13
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/4 thread(s)
1417
ns1458
ns0.97
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/8 thread(s)
1709
ns1916
ns0.89
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/1 thread(s)
1645.5
ns1375
ns1.20
bias_activation(2, act=relu)(2 x 128)/zygote/GPU/CUDA
101826.5
ns104816
ns0.97
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
7250
ns7250
ns1
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
5916
ns5875
ns1.01
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
5292
ns5916
ns0.89
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
10000
ns9875
ns1.01
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA
23857.5
ns24078
ns0.99
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
226895.5
ns229750
ns0.99
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
230375
ns228583
ns1.01
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
231584
ns230292
ns1.01
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
258625
ns213917
ns1.21
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
167659
ns172648
ns0.97
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/2 thread(s)
3875
ns3875
ns1
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/4 thread(s)
3875
ns3833
ns1.01
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/8 thread(s)
3916
ns3875
ns1.01
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/1 thread(s)
3833
ns3875
ns0.99
dense(32, bias=false, act=relu)(32 x 128)/forward/GPU/CUDA
23468
ns23922
ns0.98
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/2 thread(s)
16750
ns16458
ns1.02
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/4 thread(s)
17042
ns16583
ns1.03
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/8 thread(s)
17000
ns16958
ns1.00
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/1 thread(s)
16625
ns16750
ns0.99
dense(32, bias=false, act=relu)(32 x 128)/zygote/GPU/CUDA
160597
ns166168.5
ns0.97
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/2 thread(s)
572166
ns579542
ns0.99
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/4 thread(s)
575000
ns576458
ns1.00
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/8 thread(s)
587458
ns578750
ns1.02
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/1 thread(s)
578334
ns574667
ns1.01
dense(512, bias=false, act=gelu)(512 x 128)/forward/GPU/CUDA
113397
ns113828
ns1.00
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/2 thread(s)
1421708
ns1424688
ns1.00
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/4 thread(s)
1420125
ns1421083
ns1.00
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/8 thread(s)
1430083
ns1423208.5
ns1.00
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/1 thread(s)
1413292
ns1419500
ns1.00
dense(512, bias=false, act=gelu)(512 x 128)/zygote/GPU/CUDA
209669.5
ns215564
ns0.97
lenet(28, 28, 1, 64)/forward/CPU/2 thread(s)
1074458
ns1071229.5
ns1.00
lenet(28, 28, 1, 64)/forward/CPU/4 thread(s)
958250.5
ns961417
ns1.00
lenet(28, 28, 1, 64)/forward/CPU/8 thread(s)
1334396
ns1343000
ns0.99
lenet(28, 28, 1, 64)/forward/CPU/1 thread(s)
1310875
ns1300000.5
ns1.01
lenet(28, 28, 1, 64)/forward/GPU/CUDA
269120.5
ns277770.5
ns0.97
lenet(28, 28, 1, 64)/zygote/CPU/2 thread(s)
5769437
ns5955916
ns0.97
lenet(28, 28, 1, 64)/zygote/CPU/4 thread(s)
4470625
ns4519500
ns0.99
lenet(28, 28, 1, 64)/zygote/CPU/8 thread(s)
4941021
ns4916354.5
ns1.01
lenet(28, 28, 1, 64)/zygote/CPU/1 thread(s)
5552042
ns5726333
ns0.97
lenet(28, 28, 1, 64)/zygote/GPU/CUDA
1066489
ns1105672
ns0.96
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/2 thread(s)
500
ns500
ns1
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/4 thread(s)
542
ns500
ns1.08
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/8 thread(s)
542
ns583
ns0.93
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/1 thread(s)
500
ns500
ns1
dense(2, bias=true, act=relu)(2 x 128)/forward/GPU/CUDA
23585
ns24042
ns0.98
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/2 thread(s)
2083
ns2084
ns1.00
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/4 thread(s)
2167
ns2084
ns1.04
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/8 thread(s)
2250
ns2208
ns1.02
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/1 thread(s)
2125
ns2125
ns1
dense(2, bias=true, act=relu)(2 x 128)/zygote/GPU/CUDA
169900
ns173326.5
ns0.98
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/2 thread(s)
4084
ns4000
ns1.02
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/4 thread(s)
6250
ns4584
ns1.36
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/8 thread(s)
7209
ns7083
ns1.02
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/1 thread(s)
6125
ns4125
ns1.48
layernorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/CUDA
64199
ns65959
ns0.97
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/2 thread(s)
11083
ns11084
ns1.00
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/4 thread(s)
11625
ns11000
ns1.06
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/8 thread(s)
12000
ns12292
ns0.98
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/1 thread(s)
10917
ns10791
ns1.01
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/CUDA
446167.5
ns456125.5
ns0.98
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/2 thread(s)
6042
ns7000
ns0.86
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/4 thread(s)
7042
ns6458
ns1.09
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/8 thread(s)
8833
ns8500
ns1.04
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/1 thread(s)
7250
ns6292
ns1.15
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/CUDA
51074.5
ns54186
ns0.94
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/2 thread(s)
17292
ns16708
ns1.03
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/4 thread(s)
18334
ns17875
ns1.03
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/8 thread(s)
18083
ns18750
ns0.96
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/1 thread(s)
17229.5
ns16875
ns1.02
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/CUDA
299895.5
ns308312
ns0.97
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/2 thread(s)
459
ns500
ns0.92
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/4 thread(s)
542
ns583
ns0.93
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/8 thread(s)
542
ns583
ns0.93
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/1 thread(s)
500
ns500
ns1
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/CUDA
32630
ns33294
ns0.98
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/2 thread(s)
8458
ns8708
ns0.97
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/4 thread(s)
9041
ns9208
ns0.98
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/8 thread(s)
9166
ns9458
ns0.97
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/1 thread(s)
8459
ns8292
ns1.02
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/CUDA
158907
ns162415.5
ns0.98
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/2 thread(s)
64625
ns64625
ns1
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/4 thread(s)
64250
ns64667
ns0.99
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/8 thread(s)
65000
ns64666
ns1.01
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/1 thread(s)
64667
ns64625
ns1.00
dense(512, bias=false, act=identity)(512 x 128)/forward/GPU/CUDA
111460
ns112234
ns0.99
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/2 thread(s)
289667
ns284395.5
ns1.02
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/4 thread(s)
279750
ns286937.5
ns0.97
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/8 thread(s)
289625
ns285291
ns1.02
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/1 thread(s)
281250
ns277917
ns1.01
dense(512, bias=false, act=identity)(512 x 128)/zygote/GPU/CUDA
184453.5
ns188885.5
ns0.98
mlp7layer_bn(gelu)(32 x 256)/forward/CPU/2 thread(s)
3347125
ns3237000
ns1.03
mlp7layer_bn(gelu)(32 x 256)/forward/CPU/4 thread(s)
3015520.5
ns3046417
ns0.99
mlp7layer_bn(gelu)(32 x 256)/forward/CPU/8 thread(s)
2792979
ns3014917
ns0.93
mlp7layer_bn(gelu)(32 x 256)/forward/CPU/1 thread(s)
4064520.5
ns3953541.5
ns1.03
mlp7layer_bn(gelu)(32 x 256)/forward/GPU/CUDA
588037
ns577323
ns1.02
mlp7layer_bn(gelu)(32 x 256)/zygote/CPU/2 thread(s)
7500166
ns7569937.5
ns0.99
mlp7layer_bn(gelu)(32 x 256)/zygote/CPU/4 thread(s)
7470229.5
ns7460791.5
ns1.00
mlp7layer_bn(gelu)(32 x 256)/zygote/CPU/8 thread(s)
7393937.5
ns7457666.5
ns0.99
mlp7layer_bn(gelu)(32 x 256)/zygote/CPU/1 thread(s)
8209000
ns8209666
ns1.00
mlp7layer_bn(gelu)(32 x 256)/zygote/GPU/CUDA
1331630
ns1380365.5
ns0.96
mlp7layer_bn(gelu)(32 x 256)/enzyme/CPU/2 thread(s)
19529541
ns18994750
ns1.03
mlp7layer_bn(gelu)(32 x 256)/enzyme/CPU/4 thread(s)
19142959
ns19146458
ns1.00
mlp7layer_bn(gelu)(32 x 256)/enzyme/CPU/8 thread(s)
19022708
ns19185583
ns0.99
mlp7layer_bn(gelu)(32 x 256)/enzyme/CPU/1 thread(s)
15703750
ns15773833
ns1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/CPU/2 thread(s)
23617083
ns24040875
ns0.98
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/CPU/4 thread(s)
33598208
ns33769833
ns0.99
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/CPU/8 thread(s)
41100666
ns37025062.5
ns1.11
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/CPU/1 thread(s)
35022333
ns34849833
ns1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/GPU/CUDA
1855178.5
ns1855448
ns1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/CPU/2 thread(s)
189352250
ns192176500
ns0.99
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/CPU/4 thread(s)
163568208
ns165400792
ns0.99
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/CPU/8 thread(s)
158452896
ns153088459
ns1.04
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/CPU/1 thread(s)
438607167
ns439540208
ns1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/GPU/CUDA
13925600.5
ns13926820
ns1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/enzyme/CPU/2 thread(s)
287704167
ns292222499.5
ns0.98
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/enzyme/CPU/4 thread(s)
337952937.5
ns338088333
ns1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/enzyme/CPU/8 thread(s)
291466708
ns298393250
ns0.98
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/enzyme/CPU/1 thread(s)
395696000
ns394164437.5
ns1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
21334
ns23395.5
ns0.91
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
24375
ns23000
ns1.06
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
25771
ns26479.5
ns0.97
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
23584
ns22271
ns1.06
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA
95861
ns96215.5
ns1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
103625
ns103541.5
ns1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
103708
ns104375
ns0.99
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
104625
ns105000
ns1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
103479.5
ns106291
ns0.97
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
510517.5
ns499410
ns1.02
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/2 thread(s)
5750
ns7125
ns0.81
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/4 thread(s)
7208
ns6542
ns1.10
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/8 thread(s)
7666.5
ns7916
ns0.97
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/1 thread(s)
7166
ns5875
ns1.22
layernorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/CUDA
68604
ns67753
ns1.01
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/2 thread(s)
14708
ns15250
ns0.96
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/4 thread(s)
15916
ns15500
ns1.03
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/8 thread(s)
16666
ns16666
ns1
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/1 thread(s)
14667
ns14667
ns1
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/CUDA
483804.5
ns471687
ns1.03
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/CPU/2 thread(s)
2876500
ns3030208.5
ns0.95
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/CPU/4 thread(s)
2063833
ns2057020.5
ns1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/CPU/8 thread(s)
2288208
ns2271375
ns1.01
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/CPU/1 thread(s)
4870416
ns4518521
ns1.08
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/GPU/CUDA
587700
ns585712
ns1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/CPU/2 thread(s)
23421375
ns23780833
ns0.98
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/CPU/4 thread(s)
17990750
ns17907042
ns1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/CPU/8 thread(s)
18312792
ns16907896
ns1.08
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/CPU/1 thread(s)
35646292
ns34889792
ns1.02
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/GPU/CUDA
3104605
ns3222471
ns0.96
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/enzyme/CPU/2 thread(s)
33240625
ns33703875
ns0.99
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/enzyme/CPU/4 thread(s)
27662417
ns27577959
ns1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/enzyme/CPU/8 thread(s)
27837459
ns27463958
ns1.01
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/enzyme/CPU/1 thread(s)
41788833
ns41773187
ns1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
72083
ns73687.5
ns0.98
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
78729
ns73292
ns1.07
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
75729.5
ns83417
ns0.91
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
72459
ns74667
ns0.97
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA
100762.5
ns101830
ns0.99
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
204458
ns318542
ns0.64
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
219041
ns216770.5
ns1.01
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
320458
ns219750
ns1.46
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
205312.5
ns297396
ns0.69
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
541454.5
ns550055
ns0.98
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/2 thread(s)
11333
ns11937.5
ns0.95
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/4 thread(s)
12416
ns11958
ns1.04
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/8 thread(s)
13834
ns13395.5
ns1.03
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/1 thread(s)
13125
ns11584
ns1.13
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/CUDA
69856.5
ns71500
ns0.98
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/2 thread(s)
26520.5
ns26666
ns0.99
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/4 thread(s)
27458
ns26875
ns1.02
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/8 thread(s)
28291
ns27792
ns1.02
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/1 thread(s)
26500
ns26500
ns1
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/CUDA
473341
ns478647.5
ns0.99
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/2 thread(s)
11833
ns12458
ns0.95
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/4 thread(s)
12750
ns12750
ns1
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/8 thread(s)
14333
ns14042
ns1.02
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/1 thread(s)
13375
ns12042
ns1.11
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/CUDA
51587
ns54279
ns0.95
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/2 thread(s)
26375
ns25792
ns1.02
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/4 thread(s)
26583
ns25791
ns1.03
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/8 thread(s)
26666
ns26584
ns1.00
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/1 thread(s)
26417
ns25833.5
ns1.02
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/CUDA
302777.5
ns307846.5
ns0.98
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
178666.5
ns180187.5
ns0.99
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
180292
ns179750
ns1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
184416.5
ns183375
ns1.01
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
179709
ns179041
ns1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA
55677
ns57080
ns0.98
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
591146.5
ns584708.5
ns1.01
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
588583
ns587833
ns1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
593062
ns595750
ns1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
582708.5
ns587000
ns0.99
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
285027
ns286439
ns1.00
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/2 thread(s)
5667
ns6541.5
ns0.87
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/4 thread(s)
7167
ns6708
ns1.07
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/8 thread(s)
7895.5
ns7500
ns1.05
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/1 thread(s)
7291
ns5750
ns1.27
layernorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/CUDA
69657.5
ns70275
ns0.99
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/2 thread(s)
14167
ns13937.5
ns1.02
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/4 thread(s)
14958
ns14708
ns1.02
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/8 thread(s)
15854.5
ns15583
ns1.02
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/1 thread(s)
14583
ns13500
ns1.08
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/CUDA
460443
ns465284
ns0.99
batchedmm(512, Bsize=4)/forward/CPU/2 thread(s)
1194208.5
ns1198000
ns1.00
batchedmm(512, Bsize=4)/forward/CPU/4 thread(s)
1216792
ns1218958
ns1.00
batchedmm(512, Bsize=4)/forward/CPU/8 thread(s)
1262604
ns1268562.5
ns1.00
batchedmm(512, Bsize=4)/forward/CPU/1 thread(s)
1318166.5
ns1315416
ns1.00
batchedmm(512, Bsize=4)/forward/GPU/CUDA
301559
ns302635
ns1.00
batchedmm(512, Bsize=4)/zygote/CPU/2 thread(s)
4098416
ns4311792
ns0.95
batchedmm(512, Bsize=4)/zygote/CPU/4 thread(s)
4352937.5
ns4360354
ns1.00
batchedmm(512, Bsize=4)/zygote/CPU/8 thread(s)
4631875
ns4524583
ns1.02
batchedmm(512, Bsize=4)/zygote/CPU/1 thread(s)
4436562.5
ns4481833
ns0.99
batchedmm(512, Bsize=4)/zygote/GPU/CUDA
1042661.5
ns1039337
ns1.00
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/2 thread(s)
1750
ns1833
ns0.95
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/4 thread(s)
1833
ns1834
ns1.00
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/8 thread(s)
1834
ns1834
ns1
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/1 thread(s)
1875
ns1833
ns1.02
dense(2, bias=true, act=gelu)(2 x 128)/forward/GPU/CUDA
23523
ns23819
ns0.99
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/2 thread(s)
4792
ns4834
ns0.99
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/4 thread(s)
4875
ns4875
ns1
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/8 thread(s)
4916
ns5000
ns0.98
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/1 thread(s)
4875
ns4875
ns1
dense(2, bias=true, act=gelu)(2 x 128)/zygote/GPU/CUDA
187370
ns189325
ns0.99
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/2 thread(s)
5500
ns6250
ns0.88
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/4 thread(s)
6334
ns6084
ns1.04
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/8 thread(s)
8604
ns8291
ns1.04
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/1 thread(s)
7292
ns5750
ns1.27
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/CUDA
54466
ns56699
ns0.96
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/2 thread(s)
10958
ns11125
ns0.98
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/4 thread(s)
11792
ns12083
ns0.98
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/8 thread(s)
11708.5
ns11875
ns0.99
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/1 thread(s)
11166
ns11125
ns1.00
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/CUDA
330839
ns333470
ns0.99
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/2 thread(s)
292
ns333
ns0.88
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/4 thread(s)
333
ns333
ns1
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/8 thread(s)
292
ns333
ns0.88
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/1 thread(s)
333
ns292
ns1.14
dense(2, bias=false, act=relu)(2 x 128)/forward/GPU/CUDA
22873.5
ns23140
ns0.99
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/2 thread(s)
2708
ns2834
ns0.96
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/4 thread(s)
2959
ns2709
ns1.09
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/8 thread(s)
3042
ns3042
ns1
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/1 thread(s)
2750
ns2750
ns1
dense(2, bias=false, act=relu)(2 x 128)/zygote/GPU/CUDA
157537.5
ns160474
ns0.98
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/2 thread(s)
10750
ns11833
ns0.91
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/4 thread(s)
13708
ns12500
ns1.10
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/8 thread(s)
14958
ns15000
ns1.00
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/1 thread(s)
14583
ns11667
ns1.25
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/CUDA
55574.5
ns57479
ns0.97
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/2 thread(s)
25209
ns24667
ns1.02
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/4 thread(s)
25250
ns25000
ns1.01
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/8 thread(s)
25375
ns25583
ns0.99
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/1 thread(s)
24979.5
ns25125
ns0.99
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/CUDA
292656
ns294701.5
ns0.99
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/2 thread(s)
4208
ns4167
ns1.01
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/4 thread(s)
4125
ns4208
ns0.98
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/8 thread(s)
4167
ns4208
ns0.99
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/1 thread(s)
4167
ns4125
ns1.01
dense(32, bias=true, act=relu)(32 x 128)/forward/GPU/CUDA
24774
ns25243
ns0.98
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/2 thread(s)
16333
ns15959
ns1.02
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/4 thread(s)
16125
ns16167
ns1.00
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/8 thread(s)
16125
ns16500
ns0.98
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/1 thread(s)
16084
ns16125
ns1.00
dense(32, bias=true, act=relu)(32 x 128)/zygote/GPU/CUDA
195031.5
ns196657.5
ns0.99
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/2 thread(s)
5708
ns5667
ns1.01
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/4 thread(s)
5750
ns5708
ns1.01
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/8 thread(s)
5750
ns5709
ns1.01
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/1 thread(s)
5709
ns5708
ns1.00
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/CUDA
33326
ns34103
ns0.98
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/2 thread(s)
21125
ns20375
ns1.04
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/4 thread(s)
20875
ns21166
ns0.99
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/8 thread(s)
21583
ns21500
ns1.00
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/1 thread(s)
21500
ns21083
ns1.02
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/CUDA
175195.5
ns178406.5
ns0.98
batchedmm(16, Bsize=512)/forward/CPU/2 thread(s)
415708
ns380541
ns1.09
batchedmm(16, Bsize=512)/forward/CPU/4 thread(s)
376667
ns375333
ns1.00
batchedmm(16, Bsize=512)/forward/CPU/8 thread(s)
471499.5
ns487875
ns0.97
batchedmm(16, Bsize=512)/forward/CPU/1 thread(s)
523500
ns532687
ns0.98
batchedmm(16, Bsize=512)/forward/GPU/CUDA
66680.5
ns67192
ns0.99
batchedmm(16, Bsize=512)/zygote/CPU/2 thread(s)
924750.5
ns993167
ns0.93
batchedmm(16, Bsize=512)/zygote/CPU/4 thread(s)
849291
ns884334
ns0.96
batchedmm(16, Bsize=512)/zygote/CPU/8 thread(s)
1217521
ns1238562.5
ns0.98
batchedmm(16, Bsize=512)/zygote/CPU/1 thread(s)
1302292
ns1412624.5
ns0.92
batchedmm(16, Bsize=512)/zygote/GPU/CUDA
189339
ns189581
ns1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
79792
ns86875
ns0.92
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
82667
ns80583
ns1.03
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
84208
ns85875
ns0.98
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
82833
ns80791.5
ns1.03
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA
193132
ns192886.5
ns1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
1917625.5
ns1924208
ns1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
1915292
ns1916917
ns1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
1940917
ns1920541
ns1.01
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
1896541
ns1907750
ns0.99
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
395963
ns398152
ns0.99
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/2 thread(s)
292
ns292
ns1
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/4 thread(s)
292
ns292
ns1
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/8 thread(s)
292
ns292
ns1
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/1 thread(s)
333
ns292
ns1.14
dense(2, bias=true, act=identity)(2 x 128)/forward/GPU/CUDA
21798
ns22307
ns0.98
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/2 thread(s)
1792
ns1791
ns1.00
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/4 thread(s)
1875
ns1792
ns1.05
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/8 thread(s)
1834
ns1833
ns1.00
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/1 thread(s)
1792
ns1792
ns1
dense(2, bias=true, act=identity)(2 x 128)/zygote/GPU/CUDA
167505
ns170162
ns0.98
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/2 thread(s)
5834
ns6792
ns0.86
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/4 thread(s)
7500
ns7458.5
ns1.01
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/8 thread(s)
9958
ns9604.5
ns1.04
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/1 thread(s)
6875
ns6458.5
ns1.06
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/CUDA
58244.5
ns60140
ns0.97
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/2 thread(s)
9375
ns8875
ns1.06
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/4 thread(s)
9333
ns9208
ns1.01
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/8 thread(s)
9354.5
ns9250
ns1.01
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/1 thread(s)
9625
ns9208
ns1.05
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/CUDA
302935
ns308605.5
ns0.98
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/CPU/2 thread(s)
119443416.5
ns156095333.5
ns0.77
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/CPU/4 thread(s)
173896250
ns174294250
ns1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/CPU/8 thread(s)
155811625
ns147908167
ns1.05
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/CPU/1 thread(s)
108054541
ns105395375
ns1.03
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/GPU/CUDA
5469386
ns5479498
ns1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/CPU/2 thread(s)
616746166.5
ns674867041
ns0.91
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/CPU/4 thread(s)
555745625
ns555334333
ns1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/CPU/8 thread(s)
468855125
ns454020333.5
ns1.03
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/CPU/1 thread(s)
760571396
ns758003104
ns1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/GPU/CUDA
34956216
ns34951781
ns1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/enzyme/CPU/2 thread(s)
648663875
ns701059834
ns0.93
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/enzyme/CPU/4 thread(s)
664591146
ns666716125.5
ns1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/enzyme/CPU/8 thread(s)
601178041.5
ns580121499.5
ns1.04
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/enzyme/CPU/1 thread(s)
746069334
ns741952792
ns1.01
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
59458
ns57708
ns1.03
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
47083
ns47333
ns0.99
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
39166
ns47250
ns0.83
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
83208
ns83959
ns0.99
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA
37582
ns37806
ns0.99
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
1926708
ns1934958.5
ns1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
1983042
ns1972000
ns1.01
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
1986937.5
ns1976374.5
ns1.01
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
1850250
ns1886667
ns0.98
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
173017.5
ns174540
ns0.99
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
265187.5
ns274833.5
ns0.96
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
267959
ns267625
ns1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
276771
ns288750
ns0.96
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
266917
ns275791.5
ns0.97
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA
128834.5
ns127747
ns1.01
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
604083
ns588791.5
ns1.03
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
692833.5
ns676334
ns1.02
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
705709
ns669375.5
ns1.05
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
590291.5
ns637708
ns0.93
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
683429
ns705367
ns0.97
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
2195333
ns2201812.5
ns1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
2225625
ns2173417
ns1.02
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
2230583
ns2204166
ns1.01
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
2183333
ns2175854
ns1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA
133325.5
ns133869
ns1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
5480833
ns5561000
ns0.99
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
5508958
ns5485083
ns1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
5585895.5
ns5500791
ns1.02
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
5490125
ns5486667
ns1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
766206
ns758600
ns1.01
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/2 thread(s)
646750
ns650375
ns0.99
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/4 thread(s)
660250
ns639375
ns1.03
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/8 thread(s)
642917
ns639250
ns1.01
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/1 thread(s)
647375
ns645541
ns1.00
dense(512, bias=true, act=gelu)(512 x 128)/forward/GPU/CUDA
47306
ns46906
ns1.01
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/2 thread(s)
1828875
ns1797375
ns1.02
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/4 thread(s)
1721042
ns1723000
ns1.00
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/8 thread(s)
1665209
ns1729417
ns0.96
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/1 thread(s)
2097000
ns2102375
ns1.00
dense(512, bias=true, act=gelu)(512 x 128)/zygote/GPU/CUDA
223896.5
ns224012.5
ns1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
58667
ns57125
ns1.03
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
47750
ns46792
ns1.02
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
38958
ns46792
ns0.83
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
82750
ns83625
ns0.99
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA
29191
ns28934
ns1.01
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
2029083.5
ns2042125
ns0.99
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
2091166
ns2085750
ns1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
2107249.5
ns2086104
ns1.01
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
1994854.5
ns1992187.5
ns1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
190986
ns192769
ns0.99
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/CPU/2 thread(s)
13371291
ns13486000
ns0.99
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/CPU/4 thread(s)
12436583.5
ns12454854
ns1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/CPU/8 thread(s)
12675625
ns12584062
ns1.01
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/CPU/1 thread(s)
15146959
ns15166646
ns1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/GPU/CUDA
517535.5
ns516981.5
ns1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/CPU/2 thread(s)
47259416
ns47757417
ns0.99
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/CPU/4 thread(s)
41746209
ns41920875
ns1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/CPU/8 thread(s)
41384750
ns41057895.5
ns1.01
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/CPU/1 thread(s)
58440500
ns58660917
ns1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/GPU/CUDA
3203835
ns3200471
ns1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/enzyme/CPU/2 thread(s)
73984667
ns74173979
ns1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/enzyme/CPU/4 thread(s)
91223791.5
ns68296125
ns1.34
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/enzyme/CPU/8 thread(s)
90609938
ns90853250
ns1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/enzyme/CPU/1 thread(s)
77234000
ns76369500
ns1.01
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
59000
ns57542
ns1.03
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
47417
ns47333
ns1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
38917
ns47208
ns0.82
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
81125
ns83542
ns0.97
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA
47741
ns47283
ns1.01
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
1911646
ns1917416.5
ns1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
1970541
ns1969750
ns1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
1976417
ns1977666
ns1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
1882083
ns1891062.5
ns1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
195868.5
ns191945
ns1.02
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/2 thread(s)
292
ns250
ns1.17
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/4 thread(s)
375
ns417
ns0.90
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/8 thread(s)
375
ns416
ns0.90
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/1 thread(s)
333
ns333
ns1
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/CUDA
32615
ns32084
ns1.02
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/2 thread(s)
6500
ns6166
ns1.05
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/4 thread(s)
6375
ns6417
ns0.99
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/8 thread(s)
6750
ns6959
ns0.97
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/1 thread(s)
6375
ns6334
ns1.01
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/CUDA
176818
ns173427.5
ns1.02
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/2 thread(s)
292
ns291
ns1.00
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/4 thread(s)
292
ns292
ns1
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/8 thread(s)
292
ns292
ns1
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/1 thread(s)
250
ns250
ns1
dense(2, bias=false, act=identity)(2 x 128)/forward/GPU/CUDA
32102
ns31620
ns1.02
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/2 thread(s)
2625
ns2667
ns0.98
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/4 thread(s)
2875
ns2792
ns1.03
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/8 thread(s)
2916
ns2959
ns0.99
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/1 thread(s)
2625
ns2625
ns1
dense(2, bias=false, act=identity)(2 x 128)/zygote/GPU/CUDA
164236.5
ns161588.5
ns1.02
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/CPU/2 thread(s)
286096229
ns322222750
ns0.89
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/CPU/4 thread(s)
339570541
ns341161875
ns1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/CPU/8 thread(s)
321242167
ns313409520.5
ns1.02
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/CPU/1 thread(s)
271493208
ns272857666
ns0.99
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/GPU/CUDA
7111512
ns7106282
ns1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/CPU/2 thread(s)
987492667
ns1057275812.5
ns0.93
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/CPU/4 thread(s)
939040416
ns937359791
ns1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/CPU/8 thread(s)
868433209
ns852420750
ns1.02
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/CPU/1 thread(s)
1162204042
ns1161160000
ns1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/GPU/CUDA
34040446
ns34076180
ns1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/enzyme/CPU/2 thread(s)
1310851000.5
ns1357441042
ns0.97
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/enzyme/CPU/4 thread(s)
1685402625
ns1321006541.5
ns1.28
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/enzyme/CPU/8 thread(s)
1648347125
ns1604272875
ns1.03
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/enzyme/CPU/1 thread(s)
1310788750
ns1302899708.5
ns1.01
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
1412625
ns1417312.5
ns1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
1412041.5
ns1438625
ns0.98
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
1424625
ns1422375
ns1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
1408334
ns1404187.5
ns1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA
128501
ns127360
ns1.01
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
5028875
ns5059667
ns0.99
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
5030104
ns5032458
ns1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
5062042
ns5024750
ns1.01
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
5014021
ns5017709
ns1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
597004.5
ns498493.5
ns1.20
vgg16(32, 32, 3, 32)/forward/CPU/2 thread(s)
168008834
ns172134417
ns0.98
vgg16(32, 32, 3, 32)/forward/CPU/4 thread(s)
130299417
ns132190854
ns0.99
vgg16(32, 32, 3, 32)/forward/CPU/8 thread(s)
148283479
ns125671875
ns1.18
vgg16(32, 32, 3, 32)/forward/CPU/1 thread(s)
161948354
ns162159562.5
ns1.00
vgg16(32, 32, 3, 32)/forward/GPU/CUDA
5052268
ns4881912.5
ns1.03
vgg16(32, 32, 3, 32)/zygote/CPU/2 thread(s)
662817209
ns676531000
ns0.98
vgg16(32, 32, 3, 32)/zygote/CPU/4 thread(s)
492884417
ns642244500
ns0.77
vgg16(32, 32, 3, 32)/zygote/CPU/8 thread(s)
507367709
ns502997666
ns1.01
vgg16(32, 32, 3, 32)/zygote/CPU/1 thread(s)
678320708
ns678617458
ns1.00
vgg16(32, 32, 3, 32)/zygote/GPU/CUDA
17294527
ns17408311
ns0.99
batchedmm(512, Bsize=32)/forward/CPU/2 thread(s)
8884604
ns9098854
ns0.98
batchedmm(512, Bsize=32)/forward/CPU/4 thread(s)
8801959
ns8775166.5
ns1.00
batchedmm(512, Bsize=32)/forward/CPU/8 thread(s)
8221541.5
ns7856833.5
ns1.05
batchedmm(512, Bsize=32)/forward/CPU/1 thread(s)
10127167
ns10166000
ns1.00
batchedmm(512, Bsize=32)/forward/GPU/CUDA
1611762
ns1591045
ns1.01
batchedmm(512, Bsize=32)/zygote/CPU/2 thread(s)
36027125
ns37558563
ns0.96
batchedmm(512, Bsize=32)/zygote/CPU/4 thread(s)
36933063
ns37073459
ns1.00
batchedmm(512, Bsize=32)/zygote/CPU/8 thread(s)
34547750
ns33526542
ns1.03
batchedmm(512, Bsize=32)/zygote/CPU/1 thread(s)
38824854
ns38790125
ns1.00
batchedmm(512, Bsize=32)/zygote/GPU/CUDA
6452267
ns6476971
ns1.00
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/2 thread(s)
47375
ns47333
ns1.00
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/4 thread(s)
47250
ns47333
ns1.00
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/8 thread(s)
47542
ns47625
ns1.00
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/1 thread(s)
47333
ns47125
ns1.00
bias_activation(32, act=tanh)(32 x 128)/forward/GPU/CUDA
19020
ns19085
ns1.00
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/2 thread(s)
50312.5
ns50333
ns1.00
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/4 thread(s)
50500
ns52875
ns0.96
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/8 thread(s)
50958.5
ns53083
ns0.96
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/1 thread(s)
50333
ns50250
ns1.00
bias_activation(32, act=tanh)(32 x 128)/zygote/GPU/CUDA
226580
ns184149.5
ns1.23
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/2 thread(s)
6542
ns7458
ns0.88
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/4 thread(s)
7187.5
ns7333
ns0.98
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/8 thread(s)
9083
ns8667
ns1.05
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/1 thread(s)
8625
ns6708
ns1.29
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/CUDA
117383.5
ns84192.5
ns1.39
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/2 thread(s)
9625
ns9917
ns0.97
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/4 thread(s)
10208
ns9917
ns1.03
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/8 thread(s)
10333.5
ns11041
ns0.94
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/1 thread(s)
10209
ns9917
ns1.03
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/CUDA
723908.5
ns493810
ns1.47
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/2 thread(s)
6083
ns7542
ns0.81
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/4 thread(s)
8250
ns7667
ns1.08
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/8 thread(s)
9417
ns9667
ns0.97
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/1 thread(s)
8375
ns5417
ns1.55
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/CUDA
157024.5
ns91440.5
ns1.72
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/2 thread(s)
13292
ns12625
ns1.05
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/4 thread(s)
13792
ns13833
ns1.00
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/8 thread(s)
13708
ns14000
ns0.98
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/1 thread(s)
12834
ns12666
ns1.01
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/CUDA
618769
ns454481
ns1.36
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/2 thread(s)
1042
ns1000
ns1.04
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/4 thread(s)
1042
ns1083
ns0.96
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/8 thread(s)
1042
ns1083
ns0.96
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/1 thread(s)
1083
ns1000
ns1.08
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/CUDA
32863
ns32617
ns1.01
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/2 thread(s)
7875
ns7708
ns1.02
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/4 thread(s)
8000
ns8145.5
ns0.98
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/8 thread(s)
8208
ns8500
ns0.97
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/1 thread(s)
8250
ns8125
ns1.02
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/CUDA
246953.5
ns196206.5
ns1.26
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/2 thread(s)
25062.5
ns23083
ns1.09
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/4 thread(s)
23291.5
ns23375
ns1.00
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/8 thread(s)
23542
ns23583
ns1.00
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/1 thread(s)
23250
ns23750
ns0.98
bias_activation(32, act=gelu)(32 x 128)/forward/GPU/CUDA
18661
ns18627
ns1.00
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/2 thread(s)
52625
ns52500
ns1.00
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/4 thread(s)
52833
ns52875
ns1.00
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/8 thread(s)
52875
ns53417
ns0.99
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/1 thread(s)
52333
ns52166
ns1.00
bias_activation(32, act=gelu)(32 x 128)/zygote/GPU/CUDA
364018
ns249106
ns1.46
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
1403750
ns1448167
ns0.97
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
1451354
ns1405000
ns1.03
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
1407542
ns1405874.5
ns1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
1406458
ns1403917
ns1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA
196760
ns195637
ns1.01
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
5023250
ns5038167
ns1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
5018687.5
ns5020646
ns1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
5042125
ns5017458
ns1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
5001750
ns5008375
ns1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
766930
ns558064
ns1.37
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/CPU/2 thread(s)
3048708
ns3065354.5
ns0.99
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/CPU/4 thread(s)
2082646
ns2082084
ns1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/CPU/8 thread(s)
2300125
ns2285291
ns1.01
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/CPU/1 thread(s)
4855000
ns4897375
ns0.99
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/GPU/CUDA
583278
ns583035
ns1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/CPU/2 thread(s)
24263250
ns24715854
ns0.98
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/CPU/4 thread(s)
18905459
ns18870292
ns1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/CPU/8 thread(s)
19193375
ns18758208
ns1.02
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/CPU/1 thread(s)
36575416
ns36783917
ns0.99
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/GPU/CUDA
3216229
ns3184571
ns1.01
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/enzyme/CPU/2 thread(s)
34013563
ns34426125
ns0.99
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/enzyme/CPU/4 thread(s)
28342229
ns28319896
ns1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/enzyme/CPU/8 thread(s)
28436750
ns28022958.5
ns1.01
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/enzyme/CPU/1 thread(s)
43339875
ns41761166.5
ns1.04
batchedmm(512, Bsize=512)/forward/CPU/2 thread(s)
144288959
ns144957333
ns1.00
batchedmm(512, Bsize=512)/forward/CPU/4 thread(s)
142279583
ns142855500
ns1.00
batchedmm(512, Bsize=512)/forward/CPU/8 thread(s)
126469000.5
ns124763354
ns1.01
batchedmm(512, Bsize=512)/forward/CPU/1 thread(s)
168866000
ns173311167
ns0.97
batchedmm(512, Bsize=512)/forward/GPU/CUDA
22582893
ns22559600
ns1.00
batchedmm(512, Bsize=512)/zygote/CPU/2 thread(s)
1275599313
ns956543708
ns1.33
batchedmm(512, Bsize=512)/zygote/CPU/4 thread(s)
1058487228.5
ns1622781604
ns0.65
batchedmm(512, Bsize=512)/zygote/CPU/8 thread(s)
712851209
ns1236835833
ns0.58
batchedmm(512, Bsize=512)/zygote/CPU/1 thread(s)
668538250
ns673901750
ns0.99
batchedmm(512, Bsize=512)/zygote/GPU/CUDA
119108875
ns118606884
ns1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
83125
ns74208
ns1.12
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
76208
ns74834
ns1.02
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
78125
ns86875
ns0.90
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
72729
ns73041.5
ns1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA
365097
ns204598.5
ns1.78
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
189959
ns278208.5
ns0.68
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
287792
ns202666.5
ns1.42
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
268875
ns288416
ns0.93
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
189583.5
ns287917
ns0.66
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
1559670.5
ns1117217.5
ns1.40
batchedmm(512, Bsize=128)/forward/CPU/2 thread(s)
35476167
ns36148959
ns0.98
batchedmm(512, Bsize=128)/forward/CPU/4 thread(s)
35447729.5
ns35295854
ns1.00
batchedmm(512, Bsize=128)/forward/CPU/8 thread(s)
32304459
ns32189834
ns1.00
batchedmm(512, Bsize=128)/forward/CPU/1 thread(s)
40935146
ns40944021
ns1.00
batchedmm(512, Bsize=128)/forward/GPU/CUDA
5843273
ns5845476
ns1.00
batchedmm(512, Bsize=128)/zygote/CPU/2 thread(s)
147875542
ns151293125
ns0.98
batchedmm(512, Bsize=128)/zygote/CPU/4 thread(s)
152751312.5
ns152622708.5
ns1.00
batchedmm(512, Bsize=128)/zygote/CPU/8 thread(s)
139824437
ns134152417
ns1.04
batchedmm(512, Bsize=128)/zygote/CPU/1 thread(s)
287719375
ns287902584
ns1.00
batchedmm(512, Bsize=128)/zygote/GPU/CUDA
34882914
ns34882228
ns1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/CPU/2 thread(s)
120880395.5
ns155688000
ns0.78
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/CPU/4 thread(s)
174358791
ns174601250
ns1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/CPU/8 thread(s)
155429791
ns147696687.5
ns1.05
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/CPU/1 thread(s)
106966959
ns106151041.5
ns1.01
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/GPU/CUDA
5456342
ns5471843
ns1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/CPU/2 thread(s)
470623375
ns518343938
ns0.91
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/CPU/4 thread(s)
466918000
ns467330167
ns1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/CPU/8 thread(s)
456589562.5
ns438511083.5
ns1.04
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/CPU/1 thread(s)
742113834
ns738327500
ns1.01
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/GPU/CUDA
32255425
ns32271735
ns1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/enzyme/CPU/2 thread(s)
706243291.5
ns689829417
ns1.02
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/enzyme/CPU/4 thread(s)
652697541.5
ns655962042
ns1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/enzyme/CPU/8 thread(s)
591007625
ns572893458
ns1.03
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/enzyme/CPU/1 thread(s)
851805375
ns850499333
ns1.00
mlp7layer_bn(relu)(32 x 256)/forward/CPU/2 thread(s)
1320583.5
ns1204208
ns1.10
mlp7layer_bn(relu)(32 x 256)/forward/CPU/4 thread(s)
965875
ns909228.5
ns1.06
mlp7layer_bn(relu)(32 x 256)/forward/CPU/8 thread(s)
736687.5
ns975604.5
ns0.76
mlp7layer_bn(relu)(32 x 256)/forward/CPU/1 thread(s)
1944666.5
ns2068166
ns0.94
mlp7layer_bn(relu)(32 x 256)/forward/GPU/CUDA
564187.5
ns573967.5
ns0.98
mlp7layer_bn(relu)(32 x 256)/zygote/CPU/2 thread(s)
2971708.5
ns2921979
ns1.02
mlp7layer_bn(relu)(32 x 256)/zygote/CPU/4 thread(s)
2620334
ns2595937
ns1.01
mlp7layer_bn(relu)(32 x 256)/zygote/CPU/8 thread(s)
2535604
ns2601958
ns0.97
mlp7layer_bn(relu)(32 x 256)/zygote/CPU/1 thread(s)
3604083.5
ns3701291
ns0.97
mlp7layer_bn(relu)(32 x 256)/zygote/GPU/CUDA
1878347.5
ns1629819
ns1.15
mlp7layer_bn(relu)(32 x 256)/enzyme/CPU/2 thread(s)
6649958
ns6735042
ns0.99
mlp7layer_bn(relu)(32 x 256)/enzyme/CPU/4 thread(s)
6493042
ns6496187.5
ns1.00
mlp7layer_bn(relu)(32 x 256)/enzyme/CPU/8 thread(s)
6437479.5
ns6432833.5
ns1.00
mlp7layer_bn(relu)(32 x 256)/enzyme/CPU/1 thread(s)
4435750
ns4458667
ns0.99
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
7375
ns7208
ns1.02
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
6208
ns6084
ns1.02
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
5375
ns6125
ns0.88
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
9916
ns10000
ns0.99
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA
25400
ns25112
ns1.01
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
213645.5
ns214479.5
ns1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
221833
ns219625
ns1.01
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
221250
ns221583
ns1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
205875
ns206125
ns1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
293719.5
ns247799
ns1.19
vgg16(32, 32, 3, 64)/forward/CPU/2 thread(s)
301604437.5
ns312548750
ns0.96
vgg16(32, 32, 3, 64)/forward/CPU/4 thread(s)
221356625
ns223228250
ns0.99
vgg16(32, 32, 3, 64)/forward/CPU/8 thread(s)
223278083.5
ns196993083
ns1.13
vgg16(32, 32, 3, 64)/forward/CPU/1 thread(s)
312163250
ns310829208
ns1.00
vgg16(32, 32, 3, 64)/forward/GPU/CUDA
7672763
ns7675013
ns1.00
vgg16(32, 32, 3, 64)/zygote/CPU/2 thread(s)
1078062604.5
ns1097849625.5
ns0.98
vgg16(32, 32, 3, 64)/zygote/CPU/4 thread(s)
896268771
ns906889750
ns0.99
vgg16(32, 32, 3, 64)/zygote/CPU/8 thread(s)
880668729
ns868243875
ns1.01
vgg16(32, 32, 3, 64)/zygote/CPU/1 thread(s)
1161143188
ns1161595250
ns1.00
vgg16(32, 32, 3, 64)/zygote/GPU/CUDA
26517571
ns26504585
ns1.00
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/2 thread(s)
5500
ns5250
ns1.05
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/4 thread(s)
5750
ns6520.5
ns0.88
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/8 thread(s)
9437.5
ns7375
ns1.28
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/1 thread(s)
5875
ns5125
ns1.15
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/CUDA
201555
ns155225.5
ns1.30
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/2 thread(s)
7500
ns6917
ns1.08
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/4 thread(s)
7458
ns7541
ns0.99
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/8 thread(s)
7750
ns7584
ns1.02
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/1 thread(s)
7041.5
ns7250
ns0.97
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/CUDA
699933.5
ns614403
ns1.14
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/2 thread(s)
500
ns500
ns1
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/4 thread(s)
500
ns542
ns0.92
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/8 thread(s)
583
ns584
ns1.00
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/1 thread(s)
500
ns458
ns1.09
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/CUDA
23724.5
ns24324
ns0.98
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/2 thread(s)
9208
ns9209
ns1.00
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/4 thread(s)
9625
ns9333
ns1.03
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/8 thread(s)
9604.5
ns9709
ns0.99
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/1 thread(s)
9042
ns9083
ns1.00
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/CUDA
234828.5
ns214987
ns1.09
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/2 thread(s)
351500
ns352000
ns1.00
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/4 thread(s)
350896
ns351167
ns1.00
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/8 thread(s)
354624.5
ns352000
ns1.01
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/1 thread(s)
351708
ns351667
ns1.00
bias_activation(512, act=gelu)(512 x 128)/forward/GPU/CUDA
20984
ns21526
ns0.97
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/2 thread(s)
775417
ns822667
ns0.94
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/4 thread(s)
824916
ns803791
ns1.03
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/8 thread(s)
830958
ns774000
ns1.07
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/1 thread(s)
823958
ns819209
ns1.01
bias_activation(512, act=gelu)(512 x 128)/zygote/GPU/CUDA
306663
ns271931
ns1.13
batchedmm(16, Bsize=32)/forward/CPU/2 thread(s)
338083
ns315625
ns1.07
batchedmm(16, Bsize=32)/forward/CPU/4 thread(s)
341500
ns334062.5
ns1.02
batchedmm(16, Bsize=32)/forward/CPU/8 thread(s)
443667
ns448958
ns0.99
batchedmm(16, Bsize=32)/forward/CPU/1 thread(s)
325667
ns335542
ns0.97
batchedmm(16, Bsize=32)/forward/GPU/CUDA
17821
ns18135.5
ns0.98
batchedmm(16, Bsize=32)/zygote/CPU/2 thread(s)
696042
ns693229
ns1.00
batchedmm(16, Bsize=32)/zygote/CPU/4 thread(s)
739416.5
ns737125
ns1.00
batchedmm(16, Bsize=32)/zygote/CPU/8 thread(s)
1042874.5
ns1034583
ns1.01
batchedmm(16, Bsize=32)/zygote/CPU/1 thread(s)
692645.5
ns697563
ns0.99
batchedmm(16, Bsize=32)/zygote/GPU/CUDA
273141.5
ns240714.5
ns1.13
batchedmm(16, Bsize=128)/forward/CPU/2 thread(s)
358458.5
ns329166
ns1.09
batchedmm(16, Bsize=128)/forward/CPU/4 thread(s)
349125
ns345354
ns1.01
batchedmm(16, Bsize=128)/forward/CPU/8 thread(s)
431291.5
ns424875
ns1.02
batchedmm(16, Bsize=128)/forward/CPU/1 thread(s)
370875
ns374166
ns0.99
batchedmm(16, Bsize=128)/forward/GPU/CUDA
22357.5
ns22796
ns0.98
batchedmm(16, Bsize=128)/zygote/CPU/2 thread(s)
756625
ns753187.5
ns1.00
batchedmm(16, Bsize=128)/zygote/CPU/4 thread(s)
744208.5
ns751083
ns0.99
batchedmm(16, Bsize=128)/zygote/CPU/8 thread(s)
1073250
ns1069042
ns1.00
batchedmm(16, Bsize=128)/zygote/CPU/1 thread(s)
818125.5
ns824250
ns0.99
batchedmm(16, Bsize=128)/zygote/GPU/CUDA
221398.5
ns214489
ns1.03
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/2 thread(s)
3459
ns3458
ns1.00
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/4 thread(s)
3541
ns3500
ns1.01
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/8 thread(s)
3792
ns3875
ns0.98
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/1 thread(s)
3291
ns3292
ns1.00
bias_activation(2, act=tanh)(2 x 128)/forward/GPU/CUDA
17956
ns18145
ns0.99
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/2 thread(s)
4208
ns4417
ns0.95
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/4 thread(s)
4208
ns4208
ns1
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/8 thread(s)
4416
ns4333
ns1.02
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/1 thread(s)
4125
ns4209
ns0.98
bias_activation(2, act=tanh)(2 x 128)/zygote/GPU/CUDA
275839.5
ns237972.5
ns1.16
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/2 thread(s)
3792
ns6417
ns0.59
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/4 thread(s)
3375
ns4042
ns0.83
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/8 thread(s)
6750
ns6542
ns1.03
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/1 thread(s)
6625
ns3375
ns1.96
layernorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/CUDA
205448.5
ns174590
ns1.18
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/2 thread(s)
8334
ns8209
ns1.02
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/4 thread(s)
8459
ns8250
ns1.03
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/8 thread(s)
8500
ns8708
ns0.98
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/1 thread(s)
8541
ns8709
ns0.98
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/CUDA
1183984
ns1063088
ns1.11
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
202625
ns203375
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
210416
ns209625
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
209292
ns210958
ns0.99
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
200000
ns200833
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA
34588
ns34926
ns0.99
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
603792
ns601916
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
670625
ns633750
ns1.06
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
630958
ns622208.5
ns1.01
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
631187.5
ns586000
ns1.08
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
352652
ns307649.5
ns1.15
batchedmm(128, Bsize=128)/forward/CPU/2 thread(s)
967521
ns966417
ns1.00
batchedmm(128, Bsize=128)/forward/CPU/4 thread(s)
927063
ns932833
ns0.99
batchedmm(128, Bsize=128)/forward/CPU/8 thread(s)
964437.5
ns945958.5
ns1.02
batchedmm(128, Bsize=128)/forward/CPU/1 thread(s)
1281853.5
ns1291166
ns0.99
batchedmm(128, Bsize=128)/forward/GPU/CUDA
207244
ns208387
ns0.99
batchedmm(128, Bsize=128)/zygote/CPU/2 thread(s)
4451771
ns4606250
ns0.97
batchedmm(128, Bsize=128)/zygote/CPU/4 thread(s)
4482750
ns4489917
ns1.00
batchedmm(128, Bsize=128)/zygote/CPU/8 thread(s)
4474208
ns4299708
ns1.04
batchedmm(128, Bsize=128)/zygote/CPU/1 thread(s)
6201166
ns6229250
ns1.00
batchedmm(128, Bsize=128)/zygote/GPU/CUDA
945549
ns933347.5
ns1.01
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/2 thread(s)
3604.5
ns3875
ns0.93
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/4 thread(s)
3167
ns3833
ns0.83
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/8 thread(s)
6792
ns6167
ns1.10
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/1 thread(s)
3167
ns2917
ns1.09
layernorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/CUDA
233201
ns191984.5
ns1.21
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/2 thread(s)
7500
ns7666
ns0.98
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/4 thread(s)
7375
ns7125
ns1.04
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/8 thread(s)
7291
ns7667
ns0.95
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/1 thread(s)
7083
ns7208
ns0.98
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/CUDA
1014881
ns941897
ns1.08
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/CPU/2 thread(s)
1602833.5
ns1602667
ns1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/CPU/4 thread(s)
1187916
ns1171416
ns1.01
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/CPU/8 thread(s)
1364062
ns1364375
ns1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/CPU/1 thread(s)
2343729.5
ns2512583
ns0.93
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/GPU/CUDA
212955.5
ns215456.5
ns0.99
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/CPU/2 thread(s)
12334792
ns12345833
ns1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/CPU/4 thread(s)
9602042
ns9563708.5
ns1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/CPU/8 thread(s)
9404958
ns9248333
ns1.02
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/CPU/1 thread(s)
17966833
ns18039541.5
ns1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/GPU/CUDA
1949853
ns1941766
ns1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/enzyme/CPU/2 thread(s)
17347084
ns17410875
ns1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/enzyme/CPU/4 thread(s)
14365000
ns14343875
ns1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/enzyme/CPU/8 thread(s)
14512666
ns14290187.5
ns1.02
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/enzyme/CPU/1 thread(s)
21005479.5
ns21033375
ns1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
89791
ns93146
ns0.96
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
91729.5
ns89750
ns1.02
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
94291
ns92375
ns1.02
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
117416.5
ns104667
ns1.12
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA
126285
ns126306.5
ns1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
2023917
ns2057146
ns0.98
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
2013416.5
ns2030833
ns0.99
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
2058875
ns2027062.5
ns1.02
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
2027875
ns2024458
ns1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
1031286
ns951168
ns1.08
batchedmm(2, Bsize=4)/forward/CPU/2 thread(s)
346791.5
ns327771
ns1.06
batchedmm(2, Bsize=4)/forward/CPU/4 thread(s)
343583.5
ns344667
ns1.00
batchedmm(2, Bsize=4)/forward/CPU/8 thread(s)
412250
ns393729
ns1.05
batchedmm(2, Bsize=4)/forward/CPU/1 thread(s)
306166
ns312667
ns0.98
batchedmm(2, Bsize=4)/forward/GPU/CUDA
16010
ns16220
ns0.99
batchedmm(2, Bsize=4)/zygote/CPU/2 thread(s)
702291
ns703375.5
ns1.00
batchedmm(2, Bsize=4)/zygote/CPU/4 thread(s)
728979.5
ns721271
ns1.01
batchedmm(2, Bsize=4)/zygote/CPU/8 thread(s)
1025458
ns1023666.5
ns1.00
batchedmm(2, Bsize=4)/zygote/CPU/1 thread(s)
639875
ns653917
ns0.98
batchedmm(2, Bsize=4)/zygote/GPU/CUDA
193209
ns187186
ns1.03
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
7292
ns7083
ns1.03
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
6083
ns6125
ns0.99
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
5334
ns5833
ns0.91
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
10000
ns9916
ns1.01
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA
33620
ns34409
ns0.98
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
220479.5
ns214083
ns1.03
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
231958
ns222333.5
ns1.04
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
232041
ns221187.5
ns1.05
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
220500
ns206125
ns1.07
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
311751
ns301322.5
ns1.03
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/2 thread(s)
3708
ns3708
ns1
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/4 thread(s)
3708
ns3708
ns1
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/8 thread(s)
3709
ns3667
ns1.01
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/1 thread(s)
3667
ns3625
ns1.01
dense(32, bias=true, act=identity)(32 x 128)/forward/GPU/CUDA
22440
ns23004
ns0.98
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/2 thread(s)
14500
ns14250
ns1.02
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/4 thread(s)
14417
ns14333
ns1.01
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/8 thread(s)
14167
ns14500
ns0.98
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/1 thread(s)
14291
ns14416
ns0.99
dense(32, bias=true, act=identity)(32 x 128)/zygote/GPU/CUDA
468658
ns460312.5
ns1.02
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
95166
ns92937.5
ns1.02
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
138021
ns133375
ns1.03
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
99167
ns96583.5
ns1.03
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
122458
ns136958
ns0.89
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA
125691
ns125681
ns1.00
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
1931875
ns1754208.5
ns1.10
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
1954979
ns1922334
ns1.02
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
1946854
ns1933417
ns1.01
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
1923729.5
ns1927416.5
ns1.00
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
940251.5
ns955943
ns0.98
lenet(28, 28, 1, 32)/forward/CPU/2 thread(s)
880500
ns857708
ns1.03
lenet(28, 28, 1, 32)/forward/CPU/4 thread(s)
815125
ns817583
ns1.00
lenet(28, 28, 1, 32)/forward/CPU/8 thread(s)
1172292
ns1222291.5
ns0.96
lenet(28, 28, 1, 32)/forward/CPU/1 thread(s)
960167
ns963416
ns1.00
lenet(28, 28, 1, 32)/forward/GPU/CUDA
270704
ns275885
ns0.98
lenet(28, 28, 1, 32)/zygote/CPU/2 thread(s)
2803000
ns2826354
ns0.99
lenet(28, 28, 1, 32)/zygote/CPU/4 thread(s)
2526833
ns2472708.5
ns1.02
lenet(28, 28, 1, 32)/zygote/CPU/8 thread(s)
3361333
ns3311750
ns1.01
lenet(28, 28, 1, 32)/zygote/CPU/1 thread(s)
3405875
ns3417042
ns1.00
lenet(28, 28, 1, 32)/zygote/GPU/CUDA
1569154
ns1599363
ns0.98
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
15146
ns15667
ns0.97
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
18000
ns15541
ns1.16
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
21666
ns18791
ns1.15
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
18125
ns15042
ns1.20
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA
141811.5
ns143363
ns0.99
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
217083
ns221562
ns0.98
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
229375
ns257625
ns0.89
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
257396
ns216167
ns1.19
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
215833
ns253521
ns0.85
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
635765.5
ns648580
ns0.98
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
219750
ns221958
ns0.99
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
221500
ns222584
ns1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
226021
ns222875
ns1.01
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
223937.5
ns219542
ns1.02
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA
270450
ns287448
ns0.94
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
509917
ns560521
ns0.91
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
557729
ns506729
ns1.10
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
549792
ns497875
ns1.10
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
555791
ns524917
ns1.06
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
1308245
ns1378195
ns0.95
batchedmm(16, Bsize=4)/forward/CPU/2 thread(s)
333479
ns312208.5
ns1.07
batchedmm(16, Bsize=4)/forward/CPU/4 thread(s)
335541.5
ns334917
ns1.00
batchedmm(16, Bsize=4)/forward/CPU/8 thread(s)
437333
ns355354.5
ns1.23
batchedmm(16, Bsize=4)/forward/CPU/1 thread(s)
319417
ns323229.5
ns0.99
batchedmm(16, Bsize=4)/forward/GPU/CUDA
16583
ns16853
ns0.98
batchedmm(16, Bsize=4)/zygote/CPU/2 thread(s)
715333
ns710916
ns1.01
batchedmm(16, Bsize=4)/zygote/CPU/4 thread(s)
730292
ns725333.5
ns1.01
batchedmm(16, Bsize=4)/zygote/CPU/8 thread(s)
1025458.5
ns1020291
ns1.01
batchedmm(16, Bsize=4)/zygote/CPU/1 thread(s)
655792
ns666458
ns0.98
batchedmm(16, Bsize=4)/zygote/GPU/CUDA
193313
ns196645
ns0.98
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
17625
ns18292
ns0.96
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
17625
ns17250
ns1.02
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
20437.5
ns20250
ns1.01
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
18000
ns16687
ns1.08
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA
144711.5
ns147801.5
ns0.98
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
216667
ns219292
ns0.99
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
224083
ns219437.5
ns1.02
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
226625
ns213646
ns1.06
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
223417
ns222104.5
ns1.01
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
903796
ns1001312.5
ns0.90
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/2 thread(s)
4625
ns6458
ns0.72
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/4 thread(s)
6750
ns4792
ns1.41
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/8 thread(s)
7438
ns7250
ns1.03
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/1 thread(s)
6625
ns4458
ns1.49
layernorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/CUDA
174159.5
ns238642
ns0.73
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/2 thread(s)
10437.5
ns10792
ns0.97
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/4 thread(s)
10750
ns10375
ns1.04
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/8 thread(s)
10770.5
ns11375
ns0.95
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/1 thread(s)
10833
ns10333
ns1.05
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/CUDA
1024421
ns1064757
ns0.96
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/2 thread(s)
3646
ns6042
ns0.60
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/4 thread(s)
3334
ns3792
ns0.88
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/8 thread(s)
5625
ns4750
ns1.18
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/1 thread(s)
3500
ns3209
ns1.09
layernorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/CUDA
231660
ns236410
ns0.98
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/2 thread(s)
7708
ns7250
ns1.06
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/4 thread(s)
7792
ns7333
ns1.06
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/8 thread(s)
7625
ns8042
ns0.95
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/1 thread(s)
7167
ns7584
ns0.95
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/CUDA
1037611.5
ns1074231
ns0.97
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/CPU/2 thread(s)
23838833
ns24130479
ns0.99
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/CPU/4 thread(s)
33990646
ns38799500
ns0.88
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/CPU/8 thread(s)
41585708
ns37733750
ns1.10
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/CPU/1 thread(s)
34896229
ns34918167
ns1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/GPU/CUDA
1839186
ns1843476
ns1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/CPU/2 thread(s)
184662833
ns186803646
ns0.99
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/CPU/4 thread(s)
159634000
ns159613166
ns1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/CPU/8 thread(s)
151746084
ns146295625
ns1.04
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/CPU/1 thread(s)
415075875
ns412659125
ns1.01
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/GPU/CUDA
16506413
ns16523543
ns1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/enzyme/CPU/2 thread(s)
427351833
ns436777542
ns0.98
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/enzyme/CPU/4 thread(s)
251624521
ns253178667
ns0.99
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/enzyme/CPU/8 thread(s)
233926312.5
ns232826083.5
ns1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/enzyme/CPU/1 thread(s)
484091542
ns484428667
ns1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
181666
ns183792
ns0.99
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
183416.5
ns182000
ns1.01
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
186125
ns185584
ns1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
183834
ns182354.5
ns1.01
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA
173529.5
ns220958.5
ns0.79
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
587541
ns593000
ns0.99
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
600458
ns587187
ns1.02
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
632375
ns588166
ns1.08
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
631354
ns632000
ns1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
1005977
ns1068694.5
ns0.94
batchedmm(128, Bsize=512)/forward/CPU/2 thread(s)
3816041.5
ns3862583.5
ns0.99
batchedmm(128, Bsize=512)/forward/CPU/4 thread(s)
3637833
ns3623187
ns1.00
batchedmm(128, Bsize=512)/forward/CPU/8 thread(s)
3539646
ns3513333
ns1.01
batchedmm(128, Bsize=512)/forward/CPU/1 thread(s)
5351396
ns5351459
ns1.00
batchedmm(128, Bsize=512)/forward/GPU/CUDA
554127
ns534395
ns1.04
batchedmm(128, Bsize=512)/zygote/CPU/2 thread(s)
17372333
ns17921270.5
ns0.97
batchedmm(128, Bsize=512)/zygote/CPU/4 thread(s)
17218458.5
ns17168125
ns1.00
batchedmm(128, Bsize=512)/zygote/CPU/8 thread(s)
16979478.5
ns16586271
ns1.02
batchedmm(128, Bsize=512)/zygote/CPU/1 thread(s)
22177625
ns22125084
ns1.00
batchedmm(128, Bsize=512)/zygote/GPU/CUDA
2616933
ns2619299
ns1.00
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/2 thread(s)
583
ns542
ns1.08
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/4 thread(s)
542
ns500
ns1.08
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/8 thread(s)
542
ns625
ns0.87
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/1 thread(s)
459
ns459
ns1
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/CUDA
32036
ns32390
ns0.99
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/2 thread(s)
9667
ns9417
ns1.03
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/4 thread(s)
9750
ns8875
ns1.10
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/8 thread(s)
10125
ns10125
ns1
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/1 thread(s)
9291
ns9125
ns1.02
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/CUDA
260858
ns265134.5
ns0.98
vgg16(32, 32, 3, 128)/forward/CPU/2 thread(s)
506491042
ns505787208
ns1.00
vgg16(32, 32, 3, 128)/forward/CPU/4 thread(s)
428949104
ns430827229
ns1.00
vgg16(32, 32, 3, 128)/forward/CPU/8 thread(s)
474815000
ns432173291.5
ns1.10
vgg16(32, 32, 3, 128)/forward/CPU/1 thread(s)
671461979
ns584857000
ns1.15
vgg16(32, 32, 3, 128)/forward/GPU/CUDA
12484614.5
ns12384263
ns1.01
vgg16(32, 32, 3, 128)/zygote/CPU/2 thread(s)
2043435104.5
ns2073799791.5
ns0.99
vgg16(32, 32, 3, 128)/zygote/CPU/4 thread(s)
1631358667
ns1628408167
ns1.00
vgg16(32, 32, 3, 128)/zygote/CPU/8 thread(s)
1546812271
ns1495535812
ns1.03
vgg16(32, 32, 3, 128)/zygote/CPU/1 thread(s)
2216473375.5
ns2213815333
ns1.00
vgg16(32, 32, 3, 128)/zygote/GPU/CUDA
49204869.5
ns49261027.5
ns1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/CPU/2 thread(s)
1642542
ns1644542
ns1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/CPU/4 thread(s)
1194625
ns1184062.5
ns1.01
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/CPU/8 thread(s)
1380791
ns1367187.5
ns1.01
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/CPU/1 thread(s)
2487084
ns2468292
ns1.01
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/GPU/CUDA
215546
ns217369
ns0.99
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/CPU/2 thread(s)
12711687.5
ns12780979.5
ns0.99
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/CPU/4 thread(s)
9927625
ns9943666
ns1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/CPU/8 thread(s)
9788604.5
ns9649896
ns1.01
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/CPU/1 thread(s)
18464437.5
ns18379437
ns1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/GPU/CUDA
1995889.5
ns2035807.5
ns0.98
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/enzyme/CPU/2 thread(s)
17669166.5
ns17754833
ns1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/enzyme/CPU/4 thread(s)
14709437.5
ns14655042
ns1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/enzyme/CPU/8 thread(s)
14807645.5
ns14543333
ns1.02
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/enzyme/CPU/1 thread(s)
21465708
ns21358459
ns1.01
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/2 thread(s)
26250
ns26250
ns1
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/4 thread(s)
26250
ns26250
ns1
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/8 thread(s)
26291
ns26583
ns0.99
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/1 thread(s)
26167
ns26208
ns1.00
dense(32, bias=false, act=gelu)(32 x 128)/forward/GPU/CUDA
23873
ns23360
ns1.02
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/2 thread(s)
66917
ns66834
ns1.00
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/4 thread(s)
67333
ns67542
ns1.00
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/8 thread(s)
67083
ns67083
ns1
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/1 thread(s)
66833
ns66875
ns1.00
dense(32, bias=false, act=gelu)(32 x 128)/zygote/GPU/CUDA
382426
ns392635.5
ns0.97
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
203834
ns203542
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
209542
ns209584
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
209584
ns209708
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
199584
ns199875
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA
26132
ns25945.5
ns1.01
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
613833.5
ns608625
ns1.01
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
636667
ns632958.5
ns1.01
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
671166.5
ns622333
ns1.08
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
628229.5
ns584541.5
ns1.07
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
308600
ns349189
ns0.88
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
671687.5
ns653500
ns1.03
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
645937.5
ns670875
ns0.96
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
644791.5
ns547042
ns1.18
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
676334
ns675666.5
ns1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA
131667
ns131441
ns1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
2241875
ns2289416
ns0.98
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
2192250
ns2233958
ns0.98
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
2297042
ns2245708
ns1.02
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
2246249.5
ns2234188
ns1.01
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
1114838
ns1153968
ns0.97
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
16791
ns17583
ns0.95
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
17500
ns17000
ns1.03
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
20958
ns21083.5
ns0.99
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
16770.5
ns17479
ns0.96
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA
143001
ns142918
ns1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
230375
ns226645.5
ns1.02
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
231791.5
ns230417
ns1.01
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
266208
ns220688
ns1.21
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
260728.5
ns218917
ns1.19
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
959584
ns981199
ns0.98
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/2 thread(s)
500
ns541
ns0.92
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/4 thread(s)
542
ns583
ns0.93
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/8 thread(s)
542
ns583
ns0.93
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/1 thread(s)
500
ns458
ns1.09
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/CUDA
23163
ns23217
ns1.00
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/2 thread(s)
9604.5
ns10041.5
ns0.96
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/4 thread(s)
10292
ns10125
ns1.02
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/8 thread(s)
10625
ns10417
ns1.02
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/1 thread(s)
9584
ns9250
ns1.04
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/CUDA
255611
ns255034
ns1.00
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/2 thread(s)
5416.5
ns5916
ns0.92
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/4 thread(s)
5750
ns6229.5
ns0.92
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/8 thread(s)
9458
ns8563
ns1.10
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/1 thread(s)
5708
ns5500
ns1.04
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/CUDA
219432
ns222902
ns0.98
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/2 thread(s)
7833
ns7250
ns1.08
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/4 thread(s)
7750
ns7709
ns1.01
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/8 thread(s)
7709
ns7750
ns0.99
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/1 thread(s)
7000
ns6958.5
ns1.01
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/CUDA
764584
ns767625.5
ns1.00
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/2 thread(s)
1959
ns2291
ns0.86
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/4 thread(s)
2083
ns2250
ns0.93
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/8 thread(s)
2417
ns2333
ns1.04
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/1 thread(s)
2208
ns2333
ns0.95
bias_activation(2, act=gelu)(2 x 128)/forward/GPU/CUDA
17893
ns17725
ns1.01
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/2 thread(s)
6875
ns6542
ns1.05
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/4 thread(s)
6542
ns6667
ns0.98
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/8 thread(s)
6583
ns6958
ns0.95
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/1 thread(s)
6291
ns6583
ns0.96
bias_activation(2, act=gelu)(2 x 128)/zygote/GPU/CUDA
320459
ns317996.5
ns1.01
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/2 thread(s)
747709
ns748750
ns1.00
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/4 thread(s)
749833
ns747083
ns1.00
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/8 thread(s)
754999.5
ns747042
ns1.01
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/1 thread(s)
749375
ns749125
ns1.00
bias_activation(512, act=tanh)(512 x 128)/forward/GPU/CUDA
21357
ns21402
ns1.00
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/2 thread(s)
774854
ns790729
ns0.98
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/4 thread(s)
792687.5
ns790333.5
ns1.00
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/8 thread(s)
817042
ns773125
ns1.06
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/1 thread(s)
811166
ns775458.5
ns1.05
bias_activation(512, act=tanh)(512 x 128)/zygote/GPU/CUDA
295013.5
ns291072
ns1.01
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
7334
ns7209
ns1.02
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
6000
ns6042
ns0.99
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
5208.5
ns6083
ns0.86
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
10166
ns10125
ns1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA
33519
ns32814
ns1.02
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
219666
ns220166
ns1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
268125
ns240583
ns1.11
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
252000.5
ns228583
ns1.10
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
213562
ns255708
ns0.84
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
354278
ns355564.5
ns1.00
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/2 thread(s)
10875
ns12541
ns0.87
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/4 thread(s)
11833
ns10500
ns1.13
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/8 thread(s)
12770.5
ns13167
ns0.97
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/1 thread(s)
12000
ns10125
ns1.19
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/CUDA
238132.5
ns239405.5
ns0.99
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/2 thread(s)
24708
ns24791.5
ns1.00
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/4 thread(s)
24584
ns24375
ns1.01
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/8 thread(s)
25292
ns25541
ns0.99
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/1 thread(s)
24500
ns24812.5
ns0.99
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/CUDA
1094067.5
ns1085912
ns1.01
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/CPU/2 thread(s)
106709834
ns108107292
ns0.99
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/CPU/4 thread(s)
116906583.5
ns117455666.5
ns1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/CPU/8 thread(s)
127036729
ns120529584
ns1.05
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/CPU/1 thread(s)
117807000
ns117307042
ns1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/GPU/CUDA
2657653
ns2652543
ns1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/CPU/2 thread(s)
392558792
ns395929750
ns0.99
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/CPU/4 thread(s)
365774917
ns367066041
ns1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/CPU/8 thread(s)
431860937.5
ns354756333
ns1.22
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/CPU/1 thread(s)
483379250
ns484413208
ns1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/GPU/CUDA
15196086
ns15198392
ns1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/enzyme/CPU/2 thread(s)
758564875.5
ns767591687.5
ns0.99
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/enzyme/CPU/4 thread(s)
761412666
ns579795958
ns1.31
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/enzyme/CPU/8 thread(s)
748747542
ns743372729
ns1.01
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/enzyme/CPU/1 thread(s)
765232583
ns765609167
ns1.00
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/2 thread(s)
6625
ns7458.5
ns0.89
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/4 thread(s)
7334
ns7479.5
ns0.98
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/8 thread(s)
9041.5
ns8916
ns1.01
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/1 thread(s)
8250
ns6708
ns1.23
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/CUDA
231038.5
ns232243
ns0.99
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/2 thread(s)
14625
ns13917
ns1.05
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/4 thread(s)
14750
ns14125
ns1.04
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/8 thread(s)
14292
ns15166
ns0.94
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/1 thread(s)
14542
ns14458
ns1.01
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/CUDA
1043294.5
ns1035695.5
ns1.01
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/2 thread(s)
5875
ns9042
ns0.65
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/4 thread(s)
7959
ns6833
ns1.16
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/8 thread(s)
9167
ns9750
ns0.94
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/1 thread(s)
6333
ns5500
ns1.15
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/CUDA
228571
ns227355.5
ns1.01
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/2 thread(s)
12791
ns12625
ns1.01
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/4 thread(s)
13167
ns12959
ns1.02
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/8 thread(s)
13375
ns12917
ns1.04
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/1 thread(s)
12333
ns12292
ns1.00
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/CUDA
779066.5
ns753887
ns1.03
batchedmm(2, Bsize=128)/forward/CPU/2 thread(s)
347625
ns327750
ns1.06
batchedmm(2, Bsize=128)/forward/CPU/4 thread(s)
342625
ns342666.5
ns1.00
batchedmm(2, Bsize=128)/forward/CPU/8 thread(s)
416812
ns398083
ns1.05
batchedmm(2, Bsize=128)/forward/CPU/1 thread(s)
307083
ns317437.5
ns0.97
batchedmm(2, Bsize=128)/forward/GPU/CUDA
17023
ns16593
ns1.03
batchedmm(2, Bsize=128)/zygote/CPU/2 thread(s)
710208.5
ns702854.5
ns1.01
batchedmm(2, Bsize=128)/zygote/CPU/4 thread(s)
732125
ns720833
ns1.02
batchedmm(2, Bsize=128)/zygote/CPU/8 thread(s)
1032542
ns1025771
ns1.01
batchedmm(2, Bsize=128)/zygote/CPU/1 thread(s)
653979.5
ns661750
ns0.99
batchedmm(2, Bsize=128)/zygote/GPU/CUDA
200196.5
ns196204.5
ns1.02
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/2 thread(s)
334
ns292
ns1.14
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/4 thread(s)
375
ns375
ns1
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/8 thread(s)
375
ns375
ns1
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/1 thread(s)
333
ns291
ns1.14
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/CUDA
23569
ns23062
ns1.02
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/2 thread(s)
6375
ns6333
ns1.01
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/4 thread(s)
6584
ns6584
ns1
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/8 thread(s)
6834
ns6792
ns1.01
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/1 thread(s)
6042
ns6250
ns0.97
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/CUDA
241926
ns236488
ns1.02
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/2 thread(s)
5708
ns5833
ns0.98
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/4 thread(s)
5834
ns5792
ns1.01
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/8 thread(s)
5875
ns5875
ns1
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/1 thread(s)
5708
ns5667
ns1.01
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/CUDA
24556.5
ns24282
ns1.01
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/2 thread(s)
21562.5
ns21687
ns0.99
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/4 thread(s)
22000
ns21584
ns1.02
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/8 thread(s)
21709
ns21750
ns1.00
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/1 thread(s)
21167
ns21063
ns1.00
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/CUDA
265433.5
ns260349.5
ns1.02
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
144917
ns172458
ns0.84
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
191292
ns185292
ns1.03
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
149333
ns148917
ns1.00
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
149250
ns186625
ns0.80
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA
167659
ns166632
ns1.01
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
1319292
ns1351354.5
ns0.98
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
1331416
ns1310042
ns1.02
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
1362958
ns1312208
ns1.04
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
1326125
ns1317292
ns1.01
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
1343729.5
ns1279433
ns1.05
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
22250
ns24291
ns0.92
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
23791
ns22125
ns1.08
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
25875
ns25958
ns1.00
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
23666.5
ns21916
ns1.08
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA
286115
ns277859
ns1.03
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
146125
ns127896
ns1.14
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
118500
ns174583
ns0.68
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
129833
ns118667
ns1.09
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
175792
ns135125
ns1.30
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
1461317
ns1390180
ns1.05
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/2 thread(s)
292
ns333
ns0.88
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/4 thread(s)
375
ns375
ns1
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/8 thread(s)
375
ns375
ns1
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/1 thread(s)
292
ns291
ns1.00
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/CUDA
23352
ns22950
ns1.02
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/2 thread(s)
6334
ns6416.5
ns0.99
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/4 thread(s)
6459
ns6625
ns0.97
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/8 thread(s)
6709
ns6834
ns0.98
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/1 thread(s)
6125
ns6250
ns0.98
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/CUDA
258095.5
ns253555
ns1.02
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/2 thread(s)
4625
ns6000
ns0.77
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/4 thread(s)
4125
ns4167
ns0.99
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/8 thread(s)
7625
ns7375
ns1.03
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/1 thread(s)
4895.5
ns4666
ns1.05
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/CUDA
256357.5
ns241371.5
ns1.06
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/2 thread(s)
9959
ns10166
ns0.98
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/4 thread(s)
10125
ns10042
ns1.01
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/8 thread(s)
10333
ns10625
ns0.97
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/1 thread(s)
10333
ns10333
ns1
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/CUDA
1358318.5
ns1304285.5
ns1.04
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/2 thread(s)
1625
ns1625
ns1
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/4 thread(s)
1584
ns1584
ns1
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/8 thread(s)
1625
ns1625
ns1
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/1 thread(s)
1583
ns1583
ns1
dense(2, bias=false, act=gelu)(2 x 128)/forward/GPU/CUDA
23389
ns22830
ns1.02
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/2 thread(s)
5667
ns5708
ns0.99
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/4 thread(s)
5875
ns5667
ns1.04
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/8 thread(s)
6000
ns6042
ns0.99
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/1 thread(s)
5625
ns5583
ns1.01
dense(2, bias=false, act=gelu)(2 x 128)/zygote/GPU/CUDA
275350.5
ns270940
ns1.02
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/CPU/2 thread(s)
6780125
ns6820479
ns0.99
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/CPU/4 thread(s)
6371125
ns6334041.5
ns1.01
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/CPU/8 thread(s)
6531396
ns6486416.5
ns1.01
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/CPU/1 thread(s)
7625875
ns7665459
ns0.99
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/GPU/CUDA
214804
ns213607.5
ns1.01
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/CPU/2 thread(s)
24015354
ns24142500
ns0.99
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/CPU/4 thread(s)
21285667
ns21253833
ns1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/CPU/8 thread(s)
21085125
ns20999479
ns1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/CPU/1 thread(s)
29769250
ns29726209
ns1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/GPU/CUDA
2112477.5
ns2083084.5
ns1.01
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/enzyme/CPU/2 thread(s)
37264541.5
ns37375166.5
ns1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/enzyme/CPU/4 thread(s)
45538167
ns33959583
ns1.34
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/enzyme/CPU/8 thread(s)
45665125
ns45667583
ns1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/enzyme/CPU/1 thread(s)
38235958
ns37873562.5
ns1.01
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/2 thread(s)
6208
ns6979.5
ns0.89
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/4 thread(s)
5958.5
ns6667
ns0.89
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/8 thread(s)
8750
ns8104.5
ns1.08
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/1 thread(s)
7500
ns5479.5
ns1.37
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/CUDA
236550
ns228629.5
ns1.03
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/2 thread(s)
8750
ns8375
ns1.04
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/4 thread(s)
8375
ns8375
ns1
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/8 thread(s)
8500
ns8584
ns0.99
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/1 thread(s)
8958
ns8125
ns1.10
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/CUDA
1063848.5
ns1060872.5
ns1.00
lenet(28, 28, 1, 128)/forward/CPU/2 thread(s)
1554084
ns1527229
ns1.02
lenet(28, 28, 1, 128)/forward/CPU/4 thread(s)
1262375
ns1259812.5
ns1.00
lenet(28, 28, 1, 128)/forward/CPU/8 thread(s)
1631958.5
ns1616208
ns1.01
lenet(28, 28, 1, 128)/forward/CPU/1 thread(s)
2152375
ns2147979
ns1.00
lenet(28, 28, 1, 128)/forward/GPU/CUDA
277465
ns271439
ns1.02
lenet(28, 28, 1, 128)/zygote/CPU/2 thread(s)
7881667
ns7973083.5
ns0.99
lenet(28, 28, 1, 128)/zygote/CPU/4 thread(s)
6612667
ns6586020.5
ns1.00
lenet(28, 28, 1, 128)/zygote/CPU/8 thread(s)
7276167
ns7034625
ns1.03
lenet(28, 28, 1, 128)/zygote/CPU/1 thread(s)
10468062.5
ns10461334
ns1.00
lenet(28, 28, 1, 128)/zygote/GPU/CUDA
1876576
ns1861989
ns1.01
batchedmm(128, Bsize=4)/forward/CPU/2 thread(s)
346375
ns318167
ns1.09
batchedmm(128, Bsize=4)/forward/CPU/4 thread(s)
348937.5
ns341959
ns1.02
batchedmm(128, Bsize=4)/forward/CPU/8 thread(s)
423416.5
ns408000
ns1.04
batchedmm(128, Bsize=4)/forward/CPU/1 thread(s)
336687
ns345291
ns0.98
batchedmm(128, Bsize=4)/forward/GPU/CUDA
46390
ns46596
ns1.00
batchedmm(128, Bsize=4)/zygote/CPU/2 thread(s)
735208
ns734812.5
ns1.00
batchedmm(128, Bsize=4)/zygote/CPU/4 thread(s)
782458
ns781000
ns1.00
batchedmm(128, Bsize=4)/zygote/CPU/8 thread(s)
1081666.5
ns1068667
ns1.01
batchedmm(128, Bsize=4)/zygote/CPU/1 thread(s)
758458.5
ns746084
ns1.02
batchedmm(128, Bsize=4)/zygote/GPU/CUDA
311011.5
ns299516.5
ns1.04
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/2 thread(s)
397375
ns397708
ns1.00
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/4 thread(s)
288250
ns288000
ns1.00
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/8 thread(s)
212583
ns288125
ns0.74
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/1 thread(s)
754104.5
ns752083
ns1.00
dense(512, bias=true, act=identity)(512 x 128)/forward/GPU/CUDA
44494
ns44143
ns1.01
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/2 thread(s)
675959
ns633750
ns1.07
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/4 thread(s)
532333
ns531000
ns1.00
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/8 thread(s)
474000
ns530834
ns0.89
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/1 thread(s)
973417
ns973250
ns1.00
dense(512, bias=true, act=identity)(512 x 128)/zygote/GPU/CUDA
189847
ns188258.5
ns1.01
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
599375
ns667374.5
ns0.90
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
650333
ns643458.5
ns1.01
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
660375
ns545833
ns1.21
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
655833.5
ns678833.5
ns0.97
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA
132321
ns131695
ns1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
2469395.5
ns2403188
ns1.03
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
2363959
ns2439250
ns0.97
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
2519875.5
ns2454541
ns1.03
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
2465916
ns2454542
ns1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
1345989
ns1200754
ns1.12
batchedmm(2, Bsize=32)/forward/CPU/2 thread(s)
345583
ns325000
ns1.06
batchedmm(2, Bsize=32)/forward/CPU/4 thread(s)
342834
ns340500
ns1.01
batchedmm(2, Bsize=32)/forward/CPU/8 thread(s)
416375
ns394250
ns1.06
batchedmm(2, Bsize=32)/forward/CPU/1 thread(s)
306979.5
ns314000
ns0.98
batchedmm(2, Bsize=32)/forward/GPU/CUDA
16330
ns15982
ns1.02
batchedmm(2, Bsize=32)/zygote/CPU/2 thread(s)
703104
ns702813
ns1.00
batchedmm(2, Bsize=32)/zygote/CPU/4 thread(s)
729708
ns719125
ns1.01
batchedmm(2, Bsize=32)/zygote/CPU/8 thread(s)
1026937.5
ns1024146
ns1.00
batchedmm(2, Bsize=32)/zygote/CPU/1 thread(s)
645959
ns651667
ns0.99
batchedmm(2, Bsize=32)/zygote/GPU/CUDA
199885.5
ns196545
ns1.02
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
1460542
ns1458417
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
1500583
ns1503167
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
1491791
ns1499542
ns0.99
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
1441917
ns1439209
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA
41671
ns40255
ns1.04
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
5133500
ns5142459
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
5293250
ns5295000.5
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
5309521
ns5017687.5
ns1.06
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
4977042
ns4991625
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
197710
ns197920.5
ns1.00
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/2 thread(s)
3708
ns3708
ns1
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/4 thread(s)
3708
ns3708
ns1
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/8 thread(s)
3709
ns3667
ns1.01
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/1 thread(s)
3666
ns3667
ns1.00
dense(32, bias=false, act=identity)(32 x 128)/forward/GPU/CUDA
33362
ns33701
ns0.99
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/2 thread(s)
15125
ns14917
ns1.01
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/4 thread(s)
15500
ns15333
ns1.01
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/8 thread(s)
15125
ns15375
ns0.98
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/1 thread(s)
15083
ns15125
ns1.00
dense(32, bias=false, act=identity)(32 x 128)/zygote/GPU/CUDA
381216.5
ns380032
ns1.00
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/2 thread(s)
71375
ns71375
ns1
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/4 thread(s)
71208
ns71292
ns1.00
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/8 thread(s)
71583
ns71250
ns1.00
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/1 thread(s)
71208
ns71250
ns1.00
dense(512, bias=false, act=relu)(512 x 128)/forward/GPU/CUDA
113946.5
ns113118
ns1.01
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/2 thread(s)
319833
ns322292
ns0.99
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/4 thread(s)
319208
ns321459
ns0.99
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/8 thread(s)
327125
ns327292
ns1.00
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/1 thread(s)
318375
ns318334
ns1.00
dense(512, bias=false, act=relu)(512 x 128)/zygote/GPU/CUDA
195156
ns196182.5
ns0.99
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/2 thread(s)
959
ns1000
ns0.96
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/4 thread(s)
1042
ns1083
ns0.96
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/8 thread(s)
1083
ns1083
ns1
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/1 thread(s)
1000
ns958
ns1.04
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/CUDA
23764
ns23902
ns0.99
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/2 thread(s)
8084
ns7959
ns1.02
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/4 thread(s)
8542
ns8083
ns1.06
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/8 thread(s)
8416
ns8541
ns0.99
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/1 thread(s)
7833.5
ns8125
ns0.96
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/CUDA
263039
ns263222.5
ns1.00
batchedmm(128, Bsize=32)/forward/CPU/2 thread(s)
472416
ns451021
ns1.05
batchedmm(128, Bsize=32)/forward/CPU/4 thread(s)
468125
ns470667
ns0.99
batchedmm(128, Bsize=32)/forward/CPU/8 thread(s)
549250
ns556978.5
ns0.99
batchedmm(128, Bsize=32)/forward/CPU/1 thread(s)
550333
ns567333
ns0.97
batchedmm(128, Bsize=32)/forward/GPU/CUDA
128804.5
ns129930
ns0.99
batchedmm(128, Bsize=32)/zygote/CPU/2 thread(s)
1375292
ns1413124.5
ns0.97
batchedmm(128, Bsize=32)/zygote/CPU/4 thread(s)
1372208
ns1374375
ns1.00
batchedmm(128, Bsize=32)/zygote/CPU/8 thread(s)
1633459
ns1599125
ns1.02
batchedmm(128, Bsize=32)/zygote/CPU/1 thread(s)
1580500
ns1589500
ns0.99
batchedmm(128, Bsize=32)/zygote/GPU/CUDA
274739
ns275820
ns1.00
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/2 thread(s)
416
ns416
ns1
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/4 thread(s)
416
ns375
ns1.11
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/8 thread(s)
375
ns375
ns1
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/1 thread(s)
292
ns291
ns1.00
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/CUDA
31574
ns31985
ns0.99
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/2 thread(s)
6458
ns6375
ns1.01
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/4 thread(s)
6875
ns6375
ns1.08
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/8 thread(s)
6708
ns6833
ns0.98
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/1 thread(s)
6000
ns6291
ns0.95
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/CUDA
261869
ns265480
ns0.99
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
1727625
ns1723041.5
ns1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
1783958
ns1770375
ns1.01
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
1730916
ns1726791
ns1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
1729333
ns1769792
ns0.98
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA
168455
ns169107.5
ns1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
4352625
ns4370833
ns1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
4372937.5
ns4358458
ns1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
4412458
ns4355958
ns1.01
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
4358042
ns4350000
ns1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
1234725
ns1170977
ns1.05
bias_activation(512, act=relu)(512 x 128)/forward/CPU/2 thread(s)
6709
ns6625
ns1.01
bias_activation(512, act=relu)(512 x 128)/forward/CPU/4 thread(s)
6584
ns6750
ns0.98
bias_activation(512, act=relu)(512 x 128)/forward/CPU/8 thread(s)
7417
ns7041
ns1.05
bias_activation(512, act=relu)(512 x 128)/forward/CPU/1 thread(s)
6542
ns9000
ns0.73
bias_activation(512, act=relu)(512 x 128)/forward/GPU/CUDA
19619.5
ns21354
ns0.92
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/2 thread(s)
51083
ns33104.5
ns1.54
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/4 thread(s)
35625
ns51458
ns0.69
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/8 thread(s)
49875
ns33083
ns1.51
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/1 thread(s)
70208
ns51042
ns1.38
bias_activation(512, act=relu)(512 x 128)/zygote/GPU/CUDA
211156
ns211403.5
ns1.00
batchedmm(2, Bsize=512)/forward/CPU/2 thread(s)
354291
ns332479
ns1.07
batchedmm(2, Bsize=512)/forward/CPU/4 thread(s)
347584
ns345500
ns1.01
batchedmm(2, Bsize=512)/forward/CPU/8 thread(s)
432708
ns420625
ns1.03
batchedmm(2, Bsize=512)/forward/CPU/1 thread(s)
319521.5
ns326208
ns0.98
batchedmm(2, Bsize=512)/forward/GPU/CUDA
18053
ns18610.5
ns0.97
batchedmm(2, Bsize=512)/zygote/CPU/2 thread(s)
719104
ns719166
ns1.00
batchedmm(2, Bsize=512)/zygote/CPU/4 thread(s)
735979
ns732604
ns1.00
batchedmm(2, Bsize=512)/zygote/CPU/8 thread(s)
1039063
ns1029625
ns1.01
batchedmm(2, Bsize=512)/zygote/CPU/1 thread(s)
672750
ns679354
ns0.99
batchedmm(2, Bsize=512)/zygote/GPU/CUDA
343671.5
ns345590
ns0.99
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/2 thread(s)
75417
ns75167
ns1.00
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/4 thread(s)
75333
ns75125
ns1.00
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/8 thread(s)
75708
ns75292
ns1.01
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/1 thread(s)
74709
ns74875
ns1.00
dense(512, bias=true, act=relu)(512 x 128)/forward/GPU/CUDA
46983
ns47792
ns0.98
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/2 thread(s)
324417
ns334542
ns0.97
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/4 thread(s)
327000
ns340667
ns0.96
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/8 thread(s)
334917
ns326000
ns1.03
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/1 thread(s)
324083
ns326708
ns0.99
dense(512, bias=true, act=relu)(512 x 128)/zygote/GPU/CUDA
207721.5
ns213631.5
ns0.97
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
1486334
ns1484750
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
1527500
ns1530208
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
1519000
ns1526875
ns0.99
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
1466541
ns1463833
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA
51914
ns52711
ns0.98
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
5119333.5
ns5145375.5
ns0.99
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
5300396
ns5286834
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
5303708
ns4997792
ns1.06
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
4989375
ns4998437.5
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
201413
ns207150
ns0.97
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/2 thread(s)
28167
ns28209
ns1.00
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/4 thread(s)
28166
ns28250
ns1.00
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/8 thread(s)
28333
ns28292
ns1.00
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/1 thread(s)
28208
ns28209
ns1.00
dense(32, bias=true, act=gelu)(32 x 128)/forward/GPU/CUDA
24393
ns24880
ns0.98
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/2 thread(s)
66542
ns66375
ns1.00
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/4 thread(s)
66292
ns66584
ns1.00
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/8 thread(s)
66542
ns66542
ns1
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/1 thread(s)
66584
ns66541
ns1.00
dense(32, bias=true, act=gelu)(32 x 128)/zygote/GPU/CUDA
530998
ns537867.5
ns0.99
mlp7layer_bn(tanh)(32 x 256)/forward/CPU/2 thread(s)
1493250
ns1339125
ns1.12
mlp7layer_bn(tanh)(32 x 256)/forward/CPU/4 thread(s)
1120167
ns1143854
ns0.98
mlp7layer_bn(tanh)(32 x 256)/forward/CPU/8 thread(s)
947625
ns1056979.5
ns0.90
mlp7layer_bn(tanh)(32 x 256)/forward/CPU/1 thread(s)
2256500
ns2227833
ns1.01
mlp7layer_bn(tanh)(32 x 256)/forward/GPU/CUDA
570331
ns577124.5
ns0.99
mlp7layer_bn(tanh)(32 x 256)/zygote/CPU/2 thread(s)
3075542
ns3019562
ns1.02
mlp7layer_bn(tanh)(32 x 256)/zygote/CPU/4 thread(s)
2732479
ns2730250
ns1.00
mlp7layer_bn(tanh)(32 x 256)/zygote/CPU/8 thread(s)
2643125
ns2578250
ns1.03
mlp7layer_bn(tanh)(32 x 256)/zygote/CPU/1 thread(s)
3814770.5
ns3815792
ns1.00
mlp7layer_bn(tanh)(32 x 256)/zygote/GPU/CUDA
2010818
ns2002712
ns1.00
mlp7layer_bn(tanh)(32 x 256)/enzyme/CPU/2 thread(s)
8738917
ns8920709
ns0.98
mlp7layer_bn(tanh)(32 x 256)/enzyme/CPU/4 thread(s)
8777854.5
ns8781875
ns1.00
mlp7layer_bn(tanh)(32 x 256)/enzyme/CPU/8 thread(s)
8781417
ns8792854
ns1.00
mlp7layer_bn(tanh)(32 x 256)/enzyme/CPU/1 thread(s)
6360687.5
ns6367541.5
ns1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
81146
ns84000
ns0.97
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
81708.5
ns82083
ns1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
83708
ns84583
ns0.99
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
87687.5
ns80791.5
ns1.09
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA
192383.5
ns192031
ns1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
2016791.5
ns2015625
ns1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
2012708
ns2019458.5
ns1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
2041312
ns1745917
ns1.17
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
2015208
ns2013895.5
ns1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
798885.5
ns797860.5
ns1.00
This comment was automatically generated by workflow using github-action-benchmark.