Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug Report] Signed/unsigned arithmetic bugs in reshape ops #16104

Open
mrakitaTT opened this issue Dec 17, 2024 · 0 comments · May be fixed by #16253
Open

[Bug Report] Signed/unsigned arithmetic bugs in reshape ops #16104

mrakitaTT opened this issue Dec 17, 2024 · 0 comments · May be fixed by #16253
Assignees
Labels
bug Something isn't working

Comments

@mrakitaTT
Copy link

mrakitaTT commented Dec 17, 2024

Describe the bug
Shape and LegacyShape are using std::size_t for the return type of rank(). In some places in tt-metal/ttnn/cpp/ttnn/operations/data_movement/reshape_view/reshape.cpp there are statements like this: for (int i=0; i <tensor_shape.rank()-2; i++) where unsigned-signed can lead to overflow for certain ranks (in this case when rank is 1).

To Reproduce
Steps to reproduce the behavior:

  1. Run the reshape op in TILE layout for this shape: tensor<128x1xf32> -> tensor<128xf32>
  2. Overflow will cause the for loop to continue past the shape rank which will lead to error: Index is out of bounds for the rank, should be between 0 and 0 however is 1
    `
@mrakitaTT mrakitaTT added the bug Something isn't working label Dec 17, 2024
@mrakitaTT mrakitaTT self-assigned this Dec 17, 2024
mrakitaTT added a commit to tenstorrent/tt-mlir that referenced this issue Dec 20, 2024
This PR adds TTNN workarounds for these Metal issues:
- tenstorrent/tt-metal#13361 - By decomposing
`reduce(keepDim=false)` into `reduce(keepDim=true) + reshape`
- tenstorrent/tt-metal#16118 - By annulling
dimensions argument when all dims are being reduced

As part of this work I've also:
- Enabled conversion of `stablehlo.reduce` op with multiple reduce
dimensions
- Added reduce ops verifiers in TTIR
- Added a separate function in TTNNWorkarounds to run rewrite patterns
for decomposition and layout workarounds
- Added lots of unit tests for reduce ops to cover conversions and
verifiers
- Added lots of silicon tests for reduce ops

Opened issue #1624 on
myself to revert these workarounds once Metal issues are fixed.

Closes #805, #848

After implementing these workarounds and running tests, I've encountered
[another Metal
issue](tenstorrent/tt-metal#16104), this time
in `reshape` op. I've debugged it and I have a local fix, I will send a
PR to fix it in Metal repo, confirmed with reshape op owners. I've
opened myself an issue
#1640 to enable Reduce ops
silicon tests after this fix is uplifted.

Another issue that I've encountered while working on this is that after
the workaround pass decompositions, if we are changing the shapes of the
ops tensors, that means that their layout needs to be changed too, but
layout pass is done before the workaround pass. I've managed to solve it
by reusing the layout of the input tensor, but I am not sure if that is
a good solution and maybe we need to repeat some of the layout logic
again after workaround decompositions. FYI @sdjordjevicTT

Here is the example TTNN IR before the workarounds:
```
%3 = "ttnn.sum"(%2) <{dim_arg = [0: i32, 1 : i32, 2: i32], keep_dim = false}> : (tensor<128x32x4xf32, #ttnn_layout2>) -> tensor<1xf32, #ttnn_layout2>
```

and after the workarounds:
```
%3 = "ttnn.sum"(%2) <{keep_dim = true}> : (tensor<128x32x4xf32, #ttnn_layout2>) -> tensor<1x1x1xf32, #ttnn_layout2>
%4 = "ttnn.reshape"(%3) <{shape = [1 : i32]}> : (tensor<1x1x1xf32, #ttnn_layout2>) -> tensor<1xf32, #ttnn_layout3>
```
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant