Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Switch CI to 1.11. #462

Merged
merged 1 commit into from
Oct 19, 2024
Merged

Switch CI to 1.11. #462

merged 1 commit into from
Oct 19, 2024

Conversation

maleadt
Copy link
Member

@maleadt maleadt commented Oct 17, 2024

No description provided.

@christiangnrd
Copy link
Contributor

Should we also revert 71b784e?

@maleadt
Copy link
Member Author

maleadt commented Oct 17, 2024

Should we also revert 71b784e?

Yeah, sure. Is that the reason for the compilation failure? I thought I had fixed the underlying issues, #370 (comment).

@christiangnrd
Copy link
Contributor

christiangnrd commented Oct 17, 2024

Should we also revert 71b784e?

Yeah, sure. Is that the reason for the compilation failure? I thought I had fixed the underlying issues, #370 (comment).

It is not. I just figured that now that we have more than one macOS 15 runner we might as well benchmark with the latest features (potentially) active.

This failure seems to be unrelated to the #370 failure as this one is fully reproducible.

Working example from perf/byval.jl
using Metal

begin

const threads = 256

@inline get_inputs3(indx_y, a, b, c)                            = (a, b, c)
@inline get_inputs3(indx_y, a1, a2, b1, b2, c1, c2)             = indx_y == 1 ? (a1, b1, c1) : (a2, b2, c2)
@inline get_inputs3(indx_y, a1, a2, a3, b1, b2, b3, c1, c2, c3) = indx_y == 1 ? (a1, b1, c1) : indx_y == 2 ? (a2, b2, c2) : (a3, b3, c3)

# add arrays of matrixes kernel
function kernel_add_mat_z_slices(n, vararg...)
    x1, x2, y = get_inputs3(threadgroup_position_in_grid_2d().y, vararg...)
    i = thread_position_in_grid_1d()
    if i <= n
        @inbounds y[i] = x1[i] + x2[i]
    end
    return
end

function add_z_slices!(y, x1, x2)
    m1, n1 = size(x1[1]) #get size of first slice
    groups = (m1 * n1 + threads - 1) ÷ threads
    # get length(x1) more groups than needed to process 1 slice
    @metal groups = groups, length(x1) threads = threads kernel_add_mat_z_slices(m1 * n1, x1..., x2..., y...)
end

num_z_slices = 3
m, n = 3072, 1536    # 256 multiplier

x1 = [mtl(randn(Float32, (m, n)) .+ Float32(0.5)) for i = 1:num_z_slices]
x2 = [mtl(randn(Float32, (m, n)) .+ Float32(0.5)) for i = 1:num_z_slices]
y1 = [similar(x1[1]) for i = 1:num_z_slices]

# adding arrays in an array
Metal.@sync add_z_slices!(y1[1:num_z_slices], x1[1:num_z_slices], x2[1:num_z_slices])

end

Doesn't fail when num_z_slices = 2.

@maleadt
Copy link
Member Author

maleadt commented Oct 18, 2024

Something wrong with the debug info:

      "frames": [
        {
          "imageOffset": 56140,
          "symbol": "__abort_with_payload",
          "symbolLocation": 8,
          "imageIndex": 1
        },
        {
          "imageOffset": 212612,
          "symbol": "abort_with_payload_wrapper_internal",
          "symbolLocation": 104,
          "imageIndex": 1
        },
        {
          "imageOffset": 212508,
          "symbol": "abort_with_reason",
          "symbolLocation": 32,
          "imageIndex": 1
        },
        {
          "imageOffset": 598872,
          "symbol": "fatalErrorHandler(void*, char const*, bool)",
          "symbolLocation": 724,
          "imageIndex": 8
        },
        {
          "imageOffset": 24336744,
          "symbol": "llvm::report_fatal_error(llvm::Twine const&, bool)",
          "symbolLocation": 412,
          "imageIndex": 9
        },
        {
          "imageOffset": 30557624,
          "symbol": "llvm::report_fatal_error(char const*, bool)",
          "symbolLocation": 52,
          "imageIndex": 9
        },
        {
          "imageOffset": 12603572,
          "symbol": "llvm::UpgradeDebugInfo(llvm::Module&)",
          "symbolLocation": 452,
          "imageIndex": 9
        },

EDIT: wait, is this IR even valid?

❯ ./metallib-dis broken.metallib -o broken.ll
PHI nodes not grouped at top of basic block!
  %.in.in.in = phi {}* [ %60, %guard_pass35 ], [ %58, %guard_pass27 ], [ %59, %conversion ]
label %L26
PHI nodes not grouped at top of basic block!
  %.in128.in.in = phi {}* [ %63, %guard_pass35 ], [ %61, %guard_pass27 ], [ %62, %conversion ]
label %L26
PHI nodes not grouped at top of basic block!
  %.in129.in.in = phi {}* [ %66, %guard_pass35 ], [ %64, %guard_pass27 ], [ %65, %conversion ]
label %L26

@maleadt
Copy link
Member Author

maleadt commented Oct 18, 2024

Looks like the IR downgrader is messing up the IR.

@maleadt
Copy link
Member Author

maleadt commented Oct 18, 2024

MWE:

define void @kernel() {
top:
  store ptr addrspace(1) null, ptr null, align 8
  br label %block

block:
  %in = phi ptr [ null, %top ]
  ret void
}

The problem is that the downgrader inserts bitcasts before the phi:

define void @kernel() {
top:
  %0 = bitcast {}* null to {} addrspace(1)**
  store {} addrspace(1)* null, {} addrspace(1)** %0, align 8
  br label %block

block:                                            ; preds = %top
  %1 = bitcast {}* null to {}*
  %in = phi {}* [ %1, %top ]
  ret void
}

@maleadt
Copy link
Member Author

maleadt commented Oct 18, 2024

Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Metal Benchmarks

Benchmark suite Current: 63c587a Previous: 100f831 Ratio
private array/construct 27428.571428571428 ns 26340.333333333332 ns 1.04
private array/broadcast 459333 ns 465791 ns 0.99
private array/random/randn/Float32 756541 ns 827937.5 ns 0.91
private array/random/randn!/Float32 643500 ns 635250 ns 1.01
private array/random/rand!/Int64 546687.5 ns 562291.5 ns 0.97
private array/random/rand!/Float32 587166 ns 594500 ns 0.99
private array/random/rand/Int64 770250 ns 801791 ns 0.96
private array/random/rand/Float32 602291.5 ns 591208 ns 1.02
private array/copyto!/gpu_to_gpu 643021 ns 645500 ns 1.00
private array/copyto!/cpu_to_gpu 725124.5 ns 618187.5 ns 1.17
private array/copyto!/gpu_to_cpu 637187.5 ns 798500 ns 0.80
private array/accumulate/1d 1343999.5 ns 1333083 ns 1.01
private array/accumulate/2d 1404917 ns 1424500 ns 0.99
private array/iteration/findall/int 2096271 ns 2100167 ns 1.00
private array/iteration/findall/bool 1818541 ns 1847000 ns 0.98
private array/iteration/findfirst/int 1695083 ns 1696166.5 ns 1.00
private array/iteration/findfirst/bool 1674750 ns 1651958.5 ns 1.01
private array/iteration/scalar 3856833 ns 3657771 ns 1.05
private array/iteration/logical 3206167 ns 3264437.5 ns 0.98
private array/iteration/findmin/1d 1762167 ns 1565166 ns 1.13
private array/iteration/findmin/2d 1355333 ns 1351333.5 ns 1.00
private array/reductions/reduce/1d 1025979 ns 1063291 ns 0.96
private array/reductions/reduce/2d 664125 ns 695645.5 ns 0.95
private array/reductions/mapreduce/1d 1035916 ns 1078084 ns 0.96
private array/reductions/mapreduce/2d 665167 ns 705166 ns 0.94
private array/permutedims/4d 2713625 ns 860084 ns 3.16
private array/permutedims/2d 1005167 ns 862229.5 ns 1.17
private array/permutedims/3d 1578979 ns 919520.5 ns 1.72
private array/copy 583396 ns 574854 ns 1.01
latency/precompile 5199144875 ns 4396587542 ns 1.18
latency/ttfp 6483345937.5 ns 6698494124.5 ns 0.97
latency/import 1137595437.5 ns 722852834 ns 1.57
integration/metaldevrt 686916.5 ns 719875 ns 0.95
integration/byval/slices=1 1559417 ns 1530167 ns 1.02
integration/byval/slices=3 10793187.5 ns 9115541.5 ns 1.18
integration/byval/reference 1529583 ns 1520271 ns 1.01
integration/byval/slices=2 2556458.5 ns 2666416 ns 0.96
kernel/indexing 451334 ns 468541 ns 0.96
kernel/indexing_checked 452750 ns 461292 ns 0.98
kernel/launch 10437.333333333334 ns 8834 ns 1.18
metal/synchronization/stream 14000 ns 14583 ns 0.96
metal/synchronization/context 15042 ns 15250 ns 0.99
shared array/construct 26680.5 ns 26069.5 ns 1.02
shared array/broadcast 465167 ns 468333 ns 0.99
shared array/random/randn/Float32 773792 ns 785583 ns 0.98
shared array/random/randn!/Float32 647854 ns 626541.5 ns 1.03
shared array/random/rand!/Int64 546791 ns 564084 ns 0.97
shared array/random/rand!/Float32 586958 ns 598792 ns 0.98
shared array/random/rand/Int64 772687.5 ns 788666 ns 0.98
shared array/random/rand/Float32 566584 ns 629791 ns 0.90
shared array/copyto!/gpu_to_gpu 85208 ns 96916 ns 0.88
shared array/copyto!/cpu_to_gpu 88334 ns 88583 ns 1.00
shared array/copyto!/gpu_to_cpu 82458 ns 83458 ns 0.99
shared array/accumulate/1d 1362375 ns 1356667 ns 1.00
shared array/accumulate/2d 1392229.5 ns 1421333 ns 0.98
shared array/iteration/findall/int 1797792 ns 1792833 ns 1.00
shared array/iteration/findall/bool 1587625 ns 1620166.5 ns 0.98
shared array/iteration/findfirst/int 1409958 ns 1385791 ns 1.02
shared array/iteration/findfirst/bool 1362958.5 ns 1376291 ns 0.99
shared array/iteration/scalar 151708.5 ns 151458 ns 1.00
shared array/iteration/logical 2995771 ns 3042333 ns 0.98
shared array/iteration/findmin/1d 1459063 ns 1274875 ns 1.14
shared array/iteration/findmin/2d 1366021 ns 1346333 ns 1.01
shared array/reductions/reduce/1d 714875 ns 694458 ns 1.03
shared array/reductions/reduce/2d 668479 ns 702292 ns 0.95
shared array/reductions/mapreduce/1d 740917 ns 754229 ns 0.98
shared array/reductions/mapreduce/2d 659625 ns 705395.5 ns 0.94
shared array/permutedims/4d 2707708.5 ns 858875 ns 3.15
shared array/permutedims/2d 1016083 ns 862292 ns 1.18
shared array/permutedims/3d 1581375 ns 923916.5 ns 1.71
shared array/copy 242750 ns 246583 ns 0.98

This comment was automatically generated by workflow using github-action-benchmark.

@maleadt maleadt merged commit 15ac66d into main Oct 19, 2024
2 checks passed
@maleadt maleadt deleted the tb/1.11 branch October 19, 2024 18:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants