LoopLoadElim: calling LoopVersioning with single-iteration loop #96656

artagnon · 2024-06-25T15:35:15Z

The following example:

define void @lver.check.unnecessary(ptr %arg, ptr %arg1, i1 %arg2) {
entry:
  %load = load i32, ptr %arg, align 4
  br i1 %arg2, label %noloop.exit, label %loop.ph

loop.ph:                                          ; preds = %entry
  %sext7 = sext i32 %load to i64
  %gep8 = getelementptr i8, ptr %arg1, i64 8
  br label %loop.body

loop.body:                                        ; preds = %loop.body, %loop.ph
  %phi = phi i64 [ 0, %loop.ph ], [ %add, %loop.body ]
  %mul = mul i64 %phi, %sext7
  %gep10 = getelementptr double, ptr %gep8, i64 %mul
  %load11 = load double, ptr %gep10, align 8
  store double %load11, ptr %arg1, align 8
  %add = add i64 %phi, 1
  %icmp = icmp eq i64 %phi, 0
  br i1 %icmp, label %loop.exit, label %loop.body

noloop.exit:                                      ; preds = %entry
  %sext = sext i32 %load to i64
  %gep = getelementptr double, ptr %arg1, i64 %sext
  %load5 = load double, ptr %gep, align 8
  store double %load5, ptr %arg, align 8
  ret void

loop.exit:                                        ; preds = %loop.body
  ret void
}

has identical output after running loop-versioning before #92119. However, after that patch, the following diff is observed:

diff --git a/lver.before.ll b/lver.main.ll
index 9dce17d..fc81e12 100644
--- a/lver.before.ll
+++ b/lver.main.ll
@@ -4,22 +4,39 @@ source_filename = "hand-reduce.ll"
 define void @lver.check.unnecessary(ptr %arg, ptr %arg1, i1 %arg2) {
 entry:
   %load = load i32, ptr %arg, align 4
-  br i1 %arg2, label %noloop.exit, label %loop.ph
+  br i1 %arg2, label %noloop.exit, label %loop.body.lver.check

-loop.ph:                                          ; preds = %entry
+loop.body.lver.check:                             ; preds = %entry
   %sext7 = sext i32 %load to i64
   %gep8 = getelementptr i8, ptr %arg1, i64 8
+  %ident.check = icmp ne i32 %load, 1
+  br i1 %ident.check, label %loop.body.ph.lver.orig, label %loop.body.ph
+
+loop.body.ph.lver.orig:                           ; preds = %loop.body.lver.check
+  br label %loop.body.lver.orig
+
+loop.body.lver.orig:                              ; preds = %loop.body.lver.orig, %loop.body.ph.lver.orig
+  %phi.lver.orig = phi i64 [ 0, %loop.body.ph.lver.orig ], [ %add.lver.orig, %loop.body.lver.orig ]
+  %mul.lver.orig = mul i64 %phi.lver.orig, %sext7
+  %gep10.lver.orig = getelementptr double, ptr %gep8, i64 %mul.lver.orig
+  %load11.lver.orig = load double, ptr %gep10.lver.orig, align 8
+  store double %load11.lver.orig, ptr %arg1, align 8
+  %add.lver.orig = add i64 %phi.lver.orig, 1
+  %icmp.lver.orig = icmp eq i64 %phi.lver.orig, 0
+  br i1 %icmp.lver.orig, label %loop.exit.loopexit, label %loop.body.lver.orig
+
+loop.body.ph:                                     ; preds = %loop.body.lver.check
   br label %loop.body

-loop.body:                                        ; preds = %loop.body, %loop.ph
-  %phi = phi i64 [ 0, %loop.ph ], [ %add, %loop.body ]
+loop.body:                                        ; preds = %loop.body, %loop.body.ph
+  %phi = phi i64 [ 0, %loop.body.ph ], [ %add, %loop.body ]
   %mul = mul i64 %phi, %sext7
   %gep10 = getelementptr double, ptr %gep8, i64 %mul
   %load11 = load double, ptr %gep10, align 8
   store double %load11, ptr %arg1, align 8
   %add = add i64 %phi, 1
   %icmp = icmp eq i64 %phi, 0
-  br i1 %icmp, label %loop.exit, label %loop.body
+  br i1 %icmp, label %loop.exit.loopexit1, label %loop.body

 noloop.exit:                                      ; preds = %entry
   %sext = sext i32 %load to i64
@@ -28,6 +45,12 @@ noloop.exit:                                      ; preds = %entry
   store double %load5, ptr %arg, align 8
   ret void

-loop.exit:                                        ; preds = %loop.body
+loop.exit.loopexit:                               ; preds = %loop.body.lver.orig
+  br label %loop.exit
+
+loop.exit.loopexit1:                              ; preds = %loop.body
+  br label %loop.exit
+
+loop.exit:                                        ; preds = %loop.exit.loopexit1, %loop.exit.loopexit
   ret void
 }

This is a regression.

The underlying issue is in LoopAccessAnalysis, which produces a false equal predicate. The diff before and after running LAA on the example is:

diff --git a/laa.before b/laa.main
index 17b0a1b..1b3b1b0 100644
--- a/laa.before
+++ b/laa.main
@@ -7,5 +7,9 @@ Printing analysis 'Loop Access Analysis' for function 'lver.check.unnecessary':

     Non vectorizable stores to invariant address were not found in loop.
     SCEV assumptions:
+    Equal predicate: %load == 1

     Expressions re-written:
+    [PSE]  %gep10 = getelementptr double, ptr %gep8, i64 %mul:
+      {(8 + %arg1),+,(8 * (sext i32 %load to i64))<nsw>}<%loop.body>
+      --> {(8 + %arg1),+,8}<%loop.body>

The issue is in LoopAccessAnalysis, but the regression was seen in the user LoopVersioning. Hence, add pre-commit tests for both, in preparation to fix the issue in LoopAccessAnalysis.

Speculating the stride currently inserts a Stride == 1 predicate, which is equivalent to asserting that the that the loop executes atleast once. However, when the backedge-taken-count is known-non-negative, speculating the stride unnecessarily versions the loop. Avoid this. Fixes llvm#96656.

efriedma-quic · 2024-06-27T21:12:23Z

What makes it "unnecessary"? You have a strided load, and if the stride is 1 you can vectorize the load. I mean, in this particular case it's not really profitable because the loop can't be vectorized for other reasons, but the analysis of the load itself seems correct.

artagnon · 2024-06-28T09:03:33Z

What makes it "unnecessary"?

The fact that the loop is executed at least once. I'm not 100% sure, but I think LAA is only supposed to speculate on the stride when the TC is unknown and may be 0, in order to version the loop and insert a Stride == 1 predicate, so that one version of the loop executes with that predicate. See also #96927.

The issue is in LoopAccessAnalysis, but the regression was seen in the user LoopVersioning. Hence, add pre-commit tests for both, in preparation to fix the issue in LoopAccessAnalysis.

Speculating the stride currently inserts a Stride == 1 predicate, which is equivalent to asserting that the that the loop executes atleast once. However, when the backedge-taken-count is known-non-negative, speculating the stride unnecessarily versions the loop. Avoid this. Fixes llvm#96656.

fhahn · 2024-06-28T16:20:16Z

This is form an end-to-end run, right? If so, who is calling LoopVersioning on the loop with a single iteration? I agree that versioning for loops with a single iteration likely isn't profitable.

artagnon · 2024-06-28T17:20:54Z

Yes, it is from an end-to-end run. I'll audit the callers to fix the issue.

After pr96656.ll were added to LAA and LoopVersioning, it was decided that the bug is in a caller of LoopVersioning, not in LAA or LoopVersioning itself. The caller has now been found to be LoopLoadElim. Hence, re-organize the added tests to avoid confusion, and add a new reduced-test for llvm#96656 to LoopLoadElim, in preparation to fix the bug.

It is unnecessary for LoopLoadElim to version single-iteration loops. Don't call LoopVersioning when the BTC is known to be 1. Fixes llvm#96656.

The issue is in LoopAccessAnalysis, but the regression was seen in the user LoopVersioning. Hence, add pre-commit tests for both, in preparation to fix the issue in LoopAccessAnalysis.

artagnon · 2024-08-30T10:29:30Z

After much confusion (due to blind reliance on llvm-reduce), we now have a resolution. Yes, LoopVersioning versions single-iteration loops, but LoopLoadElimination never calls LoopVersioning with single-iteration loops, since it bails out before versioning the loop: the function responsible for this is findStoreToLoadDependences. This regression is actually about LLE becoming more powerful (although only a small regression is seen in the end-to-end compile), as LAA correctly turns certain dependences from Unknown to a known one.

The actual reproducer for this regression is:

define void @unknown_stride_known_dependence(ptr %x, ptr %y, i1 %cond) {
; CHECK-LABEL: define void @unknown_stride_known_dependence(
; CHECK-SAME: ptr [[X:%.*]], ptr [[Y:%.*]], i1 [[COND:%.*]]) {
; CHECK-NEXT:  [[ENTRY:.*:]]
; CHECK-NEXT:    [[LOAD:%.*]] = load i32, ptr [[X]], align 4
; CHECK-NEXT:    br i1 [[COND]], label %[[NOLOOP_EXIT:.*]], label %[[LOOP_LVER_CHECK:.*]]
; CHECK:       [[LOOP_LVER_CHECK]]:
; CHECK-NEXT:    [[SEXT_X:%.*]] = sext i32 [[LOAD]] to i64
; CHECK-NEXT:    [[GEP_8:%.*]] = getelementptr i8, ptr [[Y]], i64 8
; CHECK-NEXT:    [[GEP_16:%.*]] = getelementptr i8, ptr [[Y]], i64 16
; CHECK-NEXT:    [[IDENT_CHECK:%.*]] = icmp ne i32 [[LOAD]], 1
; CHECK-NEXT:    br i1 [[IDENT_CHECK]], label %[[LOOP_PH_LVER_ORIG:.*]], label %[[LOOP_PH:.*]]
; CHECK:       [[LOOP_PH_LVER_ORIG]]:
; CHECK-NEXT:    br label %[[LOOP_LVER_ORIG:.*]]
; CHECK:       [[LOOP_LVER_ORIG]]:
; CHECK-NEXT:    [[IV_LVER_ORIG:%.*]] = phi i64 [ 0, %[[LOOP_PH_LVER_ORIG]] ], [ [[IV_NEXT_LVER_ORIG:%.*]], %[[LOOP_LVER_ORIG]] ]
; CHECK-NEXT:    [[MUL_LVER_ORIG:%.*]] = mul i64 [[IV_LVER_ORIG]], [[SEXT_X]]
; CHECK-NEXT:    [[GEP_8_MUL_LVER_ORIG:%.*]] = getelementptr double, ptr [[GEP_8]], i64 [[MUL_LVER_ORIG]]
; CHECK-NEXT:    [[LOAD_8_LVER_ORIG:%.*]] = load double, ptr [[GEP_8_MUL_LVER_ORIG]], align 8
; CHECK-NEXT:    [[GEP_16_MUL_LVER_ORIG:%.*]] = getelementptr double, ptr [[GEP_16]], i64 [[MUL_LVER_ORIG]]
; CHECK-NEXT:    store double [[LOAD_8_LVER_ORIG]], ptr [[GEP_16_MUL_LVER_ORIG]], align 8
; CHECK-NEXT:    [[IV_NEXT_LVER_ORIG]] = add i64 [[IV_LVER_ORIG]], 1
; CHECK-NEXT:    [[ICMP_LVER_ORIG:%.*]] = icmp eq i64 [[IV_LVER_ORIG]], 1
; CHECK-NEXT:    br i1 [[ICMP_LVER_ORIG]], label %[[EXIT_LOOPEXIT_LOOPEXIT:.*]], label %[[LOOP_LVER_ORIG]]
; CHECK:       [[LOOP_PH]]:
; CHECK-NEXT:    [[LOAD_INITIAL:%.*]] = load double, ptr [[GEP_8]], align 8
; CHECK-NEXT:    br label %[[LOOP:.*]]
; CHECK:       [[LOOP]]:
; CHECK-NEXT:    [[STORE_FORWARDED:%.*]] = phi double [ [[LOAD_INITIAL]], %[[LOOP_PH]] ], [ [[STORE_FORWARDED]], %[[LOOP]] ]
; CHECK-NEXT:    [[IV:%.*]] = phi i64 [ 0, %[[LOOP_PH]] ], [ [[IV_NEXT:%.*]], %[[LOOP]] ]
; CHECK-NEXT:    [[MUL:%.*]] = mul i64 [[IV]], [[SEXT_X]]
; CHECK-NEXT:    [[GEP_8_MUL:%.*]] = getelementptr double, ptr [[GEP_8]], i64 [[MUL]]
; CHECK-NEXT:    [[LOAD_8:%.*]] = load double, ptr [[GEP_8_MUL]], align 8
; CHECK-NEXT:    [[GEP_16_MUL:%.*]] = getelementptr double, ptr [[GEP_16]], i64 [[MUL]]
; CHECK-NEXT:    store double [[STORE_FORWARDED]], ptr [[GEP_16_MUL]], align 8
; CHECK-NEXT:    [[IV_NEXT]] = add i64 [[IV]], 1
; CHECK-NEXT:    [[ICMP:%.*]] = icmp eq i64 [[IV]], 1
; CHECK-NEXT:    br i1 [[ICMP]], label %[[EXIT_LOOPEXIT_LOOPEXIT1:.*]], label %[[LOOP]]
; CHECK:       [[NOLOOP_EXIT]]:
; CHECK-NEXT:    [[SEXT:%.*]] = sext i32 [[LOAD]] to i64
; CHECK-NEXT:    [[GEP_Y:%.*]] = getelementptr double, ptr [[Y]], i64 [[SEXT]]
; CHECK-NEXT:    [[LOAD_Y:%.*]] = load double, ptr [[GEP_Y]], align 8
; CHECK-NEXT:    store double [[LOAD_Y]], ptr [[X]], align 8
; CHECK-NEXT:    br label %[[EXIT:.*]]
; CHECK:       [[EXIT_LOOPEXIT_LOOPEXIT]]:
; CHECK-NEXT:    br label %[[EXIT_LOOPEXIT:.*]]
; CHECK:       [[EXIT_LOOPEXIT_LOOPEXIT1]]:
; CHECK-NEXT:    br label %[[EXIT_LOOPEXIT]]
; CHECK:       [[EXIT_LOOPEXIT]]:
; CHECK-NEXT:    br label %[[EXIT]]
; CHECK:       [[EXIT]]:
; CHECK-NEXT:    ret void
;
entry:
  %load = load i32, ptr %x, align 4
  br i1 %cond, label %noloop.exit, label %loop.ph

loop.ph:                                              ; preds = %entry
  %sext.x = sext i32 %load to i64
  %gep.8 = getelementptr i8, ptr %y, i64 8
  %gep.16 = getelementptr i8, ptr %y, i64 16
  br label %loop

loop:                                                 ; preds = %loop, %loop.ph
  %iv = phi i64 [ 0, %loop.ph ], [ %iv.next, %loop ]
  %mul = mul i64 %iv, %sext.x
  %gep.8.mul = getelementptr double, ptr %gep.8, i64 %mul
  %load.8 = load double, ptr %gep.8.mul, align 8
  %gep.16.mul = getelementptr double, ptr %gep.16, i64 %mul
  store double %load.8, ptr %gep.16.mul
  %iv.next = add i64 %iv, 1
  %icmp = icmp eq i64 %iv, 1
  br i1 %icmp, label %exit, label %loop

noloop.exit:                                          ; preds = %loop.ph
  %sext = sext i32 %load to i64
  %gep.y = getelementptr double, ptr %y, i64 %sext
  %load.y = load double, ptr %gep.y
  store double %load.y, ptr %x
  br label %exit

exit:                                                 ; preds = %loop.body
  ret void
}

This is actually the case of BTC = 1, where the loop needs to be versioned for correctness. Previously, LLE bailed out in findStoreToLoadDependences, as the dependences were unknown, before versioning the loop. Now, there is a valid candidate for LLE, and indeed LLE is more powerful.

Previously, LAA returned the following:

Printing analysis 'Loop Access Analysis' for function 'dgtsv_':
  :
    Report: unsafe dependent memory operations in loop. Use #pragma clang loop distribute(enable) to allow loop distribution to attempt to isolate the offending operations into a separate loop
Unknown data dependence.
    Dependences:
      Unknown:
          %16 = load double, ptr %15, align 8 ->
          store double %16, ptr %17, align 8

    Run-time memory checks:
    Grouped accesses:

    Non vectorizable stores to invariant address were not found in loop.
    SCEV assumptions:

    Expressions re-written:

Now, LAA returns:

Printing analysis 'Loop Access Analysis' for function 'dgtsv_':
  :
    Report: unsafe dependent memory operations in loop. Use #pragma clang loop distribute(enable) to allow loop distribution to attempt to isolate the offending operations into a separate loop
Backward loop carried data dependence.
    Dependences:
      Backward:
          %16 = load double, ptr %15, align 8 ->
          store double %16, ptr %17, align 8

    Run-time memory checks:
    Grouped accesses:

    Non vectorizable stores to invariant address were not found in loop.
    SCEV assumptions:
    Equal predicate: %4 == 1

    Expressions re-written:
    [PSE]  %15 = getelementptr double, ptr %10, i64 %14:
      {(8 + %1),+,(8 * (sext i32 %4 to i64))<nsw>}<%12>
      --> {(8 + %1),+,8}<%12>
    [PSE]  %17 = getelementptr double, ptr %11, i64 %14:
      {(16 + %1),+,(8 * (sext i32 %4 to i64))<nsw>}<%12>
      --> {(16 + %1),+,8}<%12>

Yes, there is an equal-predicate, due to which loop-versioning versions the loop, but this output is correct, and callers of LAA are more powerful now. Hence, I would classify this bug as invalid.

After pr96656.ll were added to LAA and LoopVersioning, it was decided that the bug is in a caller of LoopVersioning, not in LAA or LoopVersioning itself. The new candidate was LoopLoadElim, but #96656 has since been marked invalid. Hence, re-organize the added tests to avoid confusion, and the testcase from the investigation to LoopLoadElim.

After pr96656.ll were added to LAA and LoopVersioning, it was decided that the bug is in a caller of LoopVersioning, not in LAA or LoopVersioning itself. The new candidate was LoopLoadElim, but llvm#96656 has since been marked invalid. Hence, re-organize the added tests to avoid confusion, and the testcase from the investigation to LoopLoadElim.

artagnon added regression loopoptim labels Jun 25, 2024

artagnon mentioned this issue Jun 25, 2024

LAA: strip unnecessary getUniqueCastUse #92119

Merged

artagnon changed the title ~~LoopVersioning: unnecessary lver.check inserted since #92119~~ LAA: unnecessary equal predicate since #92119 Jun 25, 2024

artagnon added the llvm:analysis label Jun 25, 2024

artagnon mentioned this issue Jun 27, 2024

LAA: don't version single-iteration loops #96927

Closed

artagnon mentioned this issue Jul 3, 2024

LoopLoadElim: re-org tests after invalid #96656 #97598

Merged

artagnon added a commit to artagnon/llvm-project that referenced this issue Jul 3, 2024

LoopLoadElim: don't version single-iteration loops

604c223

It is unnecessary for LoopLoadElim to version single-iteration loops. Don't call LoopVersioning when the BTC is known to be 1. Fixes llvm#96656.

artagnon mentioned this issue Jul 3, 2024

LoopLoadElim: don't version single-iteration loops #97599

Closed

artagnon changed the title ~~LAA: unnecessary equal predicate since #92119~~ LoopLoadElim: calling LoopVersioning with single-iteration loop Jul 3, 2024

artagnon closed this as not planned Won't fix, can't repro, duplicate, stale Aug 30, 2024

artagnon added a commit to artagnon/llvm-project that referenced this issue Aug 30, 2024

test: re-org, clean up after invalid llvm#96656

2eced31

EugeneZelenko added the question A question, not bug report. Check out https://llvm.org/docs/GettingInvolved.html instead! label Aug 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LoopLoadElim: calling LoopVersioning with single-iteration loop #96656

LoopLoadElim: calling LoopVersioning with single-iteration loop #96656

artagnon commented Jun 25, 2024 •

edited

Loading

efriedma-quic commented Jun 27, 2024

artagnon commented Jun 28, 2024

fhahn commented Jun 28, 2024

artagnon commented Jun 28, 2024

artagnon commented Aug 30, 2024

LoopLoadElim: calling LoopVersioning with single-iteration loop #96656

LoopLoadElim: calling LoopVersioning with single-iteration loop #96656

Comments

artagnon commented Jun 25, 2024 • edited Loading

efriedma-quic commented Jun 27, 2024

artagnon commented Jun 28, 2024

fhahn commented Jun 28, 2024

artagnon commented Jun 28, 2024

artagnon commented Aug 30, 2024

artagnon commented Jun 25, 2024 •

edited

Loading