-
Notifications
You must be signed in to change notification settings - Fork 40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use CPU copy with SharedStorage #445
Merged
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Metal Benchmarks
Benchmark suite | Current: e9ac0d2 | Previous: ff7c7eb | Ratio |
---|---|---|---|
private array/construct |
27208.333333333332 ns |
26687.5 ns |
1.02 |
private array/broadcast |
455584 ns |
465979.5 ns |
0.98 |
private array/random/randn/Float32 |
1011500 ns |
993270.5 ns |
1.02 |
private array/random/randn!/Float32 |
631583 ns |
632166.5 ns |
1.00 |
private array/random/rand!/Int64 |
577417 ns |
568500 ns |
1.02 |
private array/random/rand!/Float32 |
586000 ns |
583500 ns |
1.00 |
private array/random/rand/Int64 |
877125 ns |
880458 ns |
1.00 |
private array/random/rand/Float32 |
703750 ns |
844333.5 ns |
0.83 |
private array/copyto!/gpu_to_gpu |
622250 ns |
614333 ns |
1.01 |
private array/copyto!/cpu_to_gpu |
692250 ns |
739479 ns |
0.94 |
private array/copyto!/gpu_to_cpu |
594083.5 ns |
599208 ns |
0.99 |
private array/accumulate/1d |
1434083 ns |
1447750.5 ns |
0.99 |
private array/accumulate/2d |
1479500 ns |
1496375 ns |
0.99 |
private array/iteration/findall/int |
2218500 ns |
2263917 ns |
0.98 |
private array/iteration/findall/bool |
2002187.5 ns |
1989875 ns |
1.01 |
private array/iteration/findfirst/int |
1688250 ns |
1678000 ns |
1.01 |
private array/iteration/findfirst/bool |
1650625 ns |
1663625 ns |
0.99 |
private array/iteration/scalar |
2399750 ns |
2393834 ns |
1.00 |
private array/iteration/logical |
3446416 ns |
3431520.5 ns |
1.00 |
private array/iteration/findmin/1d |
1757084 ns |
1794125 ns |
0.98 |
private array/iteration/findmin/2d |
1358875 ns |
1403416 ns |
0.97 |
private array/reductions/reduce/1d |
800917 ns |
805792 ns |
0.99 |
private array/reductions/reduce/2d |
700479.5 ns |
704146 ns |
0.99 |
private array/reductions/mapreduce/1d |
811125 ns |
815812.5 ns |
0.99 |
private array/reductions/mapreduce/2d |
701166.5 ns |
716666.5 ns |
0.98 |
private array/permutedims/4d |
947645.5 ns |
943959 ns |
1.00 |
private array/permutedims/2d |
950791 ns |
938875 ns |
1.01 |
private array/permutedims/3d |
1007916 ns |
1005416.5 ns |
1.00 |
private array/copy |
876354.5 ns |
862875 ns |
1.02 |
latency/precompile |
4414162875 ns |
4407793041 ns |
1.00 |
latency/ttfp |
6916084749.5 ns |
6915521687.5 ns |
1.00 |
latency/import |
726415791.5 ns |
726643917 ns |
1.00 |
integration/metaldevrt |
743792 ns |
749270.5 ns |
0.99 |
integration/byval/slices=1 |
1482750 ns |
1557959 ns |
0.95 |
integration/byval/slices=3 |
8832249.5 ns |
8832020.5 ns |
1.00 |
integration/byval/reference |
1515979 ns |
1611291 ns |
0.94 |
integration/byval/slices=2 |
2747375 ns |
2583750 ns |
1.06 |
kernel/indexing |
469583 ns |
476584 ns |
0.99 |
kernel/indexing_checked |
444083 ns |
441500 ns |
1.01 |
kernel/launch |
11125 ns |
10875 ns |
1.02 |
metal/synchronization/stream |
19292 ns |
19208 ns |
1.00 |
metal/synchronization/context |
19792 ns |
19750 ns |
1.00 |
shared array/construct |
24017.416666666664 ns |
23756.916666666664 ns |
1.01 |
shared array/broadcast |
466625 ns |
469584 ns |
0.99 |
shared array/random/randn/Float32 |
1024625 ns |
1020166 ns |
1.00 |
shared array/random/randn!/Float32 |
632917 ns |
634458 ns |
1.00 |
shared array/random/rand!/Int64 |
579292 ns |
572000 ns |
1.01 |
shared array/random/rand!/Float32 |
598750 ns |
593208.5 ns |
1.01 |
shared array/random/rand/Int64 |
862833 ns |
742792 ns |
1.16 |
shared array/random/rand/Float32 |
883625 ns |
898812.5 ns |
0.98 |
shared array/copyto!/gpu_to_gpu |
97125 ns |
659667 ns |
0.15 |
shared array/copyto!/cpu_to_gpu |
87542 ns |
94458 ns |
0.93 |
shared array/copyto!/gpu_to_cpu |
82041 ns |
84333 ns |
0.97 |
shared array/accumulate/1d |
1434500 ns |
1418250 ns |
1.01 |
shared array/accumulate/2d |
1492917 ns |
1500167 ns |
1.00 |
shared array/iteration/findall/int |
1972125 ns |
1939666 ns |
1.02 |
shared array/iteration/findall/bool |
1780625 ns |
1746333 ns |
1.02 |
shared array/iteration/findfirst/int |
1405208 ns |
1413458 ns |
0.99 |
shared array/iteration/findfirst/bool |
1369834 ns |
1374750 ns |
1.00 |
shared array/iteration/scalar |
187667 ns |
189167 ns |
0.99 |
shared array/iteration/logical |
3193624.5 ns |
3212770.5 ns |
0.99 |
shared array/iteration/findmin/1d |
1460500 ns |
1481709 ns |
0.99 |
shared array/iteration/findmin/2d |
1374084 ns |
1379250 ns |
1.00 |
shared array/reductions/reduce/1d |
673729 ns |
659583 ns |
1.02 |
shared array/reductions/reduce/2d |
698209 ns |
706354 ns |
0.99 |
shared array/reductions/mapreduce/1d |
631187 ns |
620667 ns |
1.02 |
shared array/reductions/mapreduce/2d |
706416.5 ns |
704958.5 ns |
1.00 |
shared array/permutedims/4d |
954291 ns |
963438 ns |
0.99 |
shared array/permutedims/2d |
918604 ns |
939020.5 ns |
0.98 |
shared array/permutedims/3d |
1013459 ns |
1003520.5 ns |
1.01 |
shared array/copy |
239958.5 ns |
880541 ns |
0.27 |
This comment was automatically generated by workflow using github-action-benchmark.
christiangnrd
added
speculative
Note sure if we want this.
performance
Gotta go fast.
labels
Oct 4, 2024
christiangnrd
force-pushed
the
fastercopy
branch
from
October 4, 2024 17:46
e9ac0d2
to
d72ce7d
Compare
[only special]
christiangnrd
force-pushed
the
fastercopy
branch
from
October 5, 2024 18:27
d72ce7d
to
509664a
Compare
I think so; we have similar optimizations in CUDA.jl with unified memory. Copies from and to CPU memory are blocking anyway. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Use CPU copy for shared storage arrays to avoid ObjectiveC.jl overhead.
Is this even a good idea?
Depends on #452