Add support for DRAM Prefetcher op #16244

avoraTT · 2024-12-20T19:31:46Z

Ticket

Dram pre-fetcher OP #15597

Problem description

One of the main blockers to achieving 80 t/s/u on TG for the Llama family of models, are the 5 DRAM-bound matmuls present in the model (QKV, DO, FF1/2/3).

What's changed

Add a new ttnn.dram_prefetcher op, that will run asynchronously in the background, and will prefetch weights for the matmuls from DRAM into L1.

Interface

The DRAM Prefetcher Op takes in the following args:

list of ttnn tensors (1 layer). This is used to get the shapes of all the tensors in 1 layer.
a ttnn tensors containing the DRAM addresses of all tensors in all layers.
- The format is as follows: [t1_l1, t2_l1, ..., t1_l2, t2_l2, ..., t1_l3, t2_l3, ...]
number of layers
global circular buffer object

Prefetcher

Each of the DRAM banks have a closest core, which we call the dram reader core. The prefetcher runs on these cores. The reader kernel reads in a tensors from DRAM and stores it in a local CB. The writer kernel reads from the the local CB, and uses NOC0 to write to 2 neighboring cores, calling remote_cb_push_back on the Global CB provided by the user. These neighboring cores (aka receiver cores) are the consumers of the DRAM prefetched tensors. As such, the matmuls must be performed on a CoreRangeSet that is made of these specific receiver cores. For 12 DRAM reader cores, they each have 2 neighbor cores that the prefetcher writes to, so we have 24 cores to perform the matmul on.

Here's an example of what the grid looks like on a TG. The red cores are the DRAM reader cores and the purple cores are the receiver cores, ie the matmul cores.

Matmul

The prefetcher is designed to be paired up with a Matmul1D with the gather_in0 mode (where the activations are ring gathered instead of being mcasted, see details in #14964). For this matmul, both the activations and weights must be sharded. When combined with the prefetcher the Global CB is used as a synchronization mechanism (remote_cb_wait_front). This leads to a seamless overlap between the prefetcher writing weights into the matmul cores, and the matmul op consuming them.

However, since both the ops involve data movement across cores (prefetcher: writing to receiver cores, matmul: gathering activations), it is important to use separate NOCs to eliminate NOC congestion. As such, the matmul ring is ordered in a specific fashion, such that only NOC1 is used (see diagram above).

As seen above, the NOC1 matmul rings contains an extra core at 4,8, which is required to complete the ring while satisfying the constraint of only using NOC1. This core is called a hop_core. This PR also adds support in Matmul1D's gather_in0 mode to take in a list of hop_cores that are at the end of the ring. These cores are simply used for data movement and serve the purpose of completing the ring so that the activations can be gathered. As such, they are not involved in any computation.

Here are the results for the ring gather in a FF1 matmul measured on a 900 MHz WH machine. Although the NOC0/1 ring is faster by itself, the NOC1 only ring with hop_cores does not slow down due to interference from the prefetcher.

Matmul	No interference (us)	With interference (us)	Slowdown due to interference
NOC0/1 ring	8.14	9.96	22.36%
NOC1 ring	8.67	8.78	1.27%

To handle different in1 tensor storage cases, the matmul compute kernel needs to manually handle the read pointers.
in the global CB, in1 tensor can have either contiguous allocation, or split into bottom and top parts, as it could reach the bottom of global Cb and wrap back to top of the CB. Each core can also start at different block ids, so a core can start reading from the top, then later read the bottom, and vice versa.

Putting it all together

To combine the prefetcher and the matmul, they each must run in their own SubDevice. The DRAM reader cores are placed in a SubDevice that is separate from the matmul cores. Once lauched, both these ops run in parallel, where the matmul stalls until it receives the weights from the prefetcher.

Checklist

Post commit CI passes https://github.com/tenstorrent/tt-metal/actions/runs/12656707082
Blackhole Post commit (if applicable)
Model regression CI testing passes (if applicable)
- T3K, same failures as main: https://github.com/tenstorrent/tt-metal/actions/runs/12658982507
Device performance regression CI testing passes (if applicable)
TG unit test https://github.com/tenstorrent/tt-metal/actions/runs/12656711112/job/35271026872

tt_metal/hw/firmware/src/brisck.cc

tt_metal/hw/firmware/src/ncrisck.cc

tt_metal/hw/firmware/src/trisck.cc

tt_metal/hw/inc/remote_circular_buffer_api.h

...erations/matmul/device/kernels/compute/bmm_large_block_zm_fused_bias_activation_gathered.cpp

ttnn/cpp/ttnn/operations/prefetcher/prefetcher/device/dram_prefetcher_op_multi_core.cpp

tt_metal/hw/inc/remote_circular_buffer_api.h

tt_metal/tt_stl/reflection.hpp

…s, and fails on n150/n300.

…t-commit tests for now.

…uldn't be.

…ce for vectors in prefetcher host code.

This reverts commit 8e405f1.

SeanNijjar

Didn't do too much deep dive before. Just nits and comments.

SeanNijjar · 2025-01-08T23:59:46Z

ttnn/cpp/ttnn/operations/prefetcher/prefetcher/device/kernels/writer_l1.cpp

+    constexpr uint32_t num_receivers = get_compile_time_arg_val(3);
+    constexpr uint32_t max_block_num_tiles = get_compile_time_arg_val(4);
+
+    constexpr uint32_t local_cb_id = tt::CBIndex::c_0;


Can these just be specified as CT args?

SeanNijjar · 2025-01-09T00:03:44Z

ttnn/cpp/ttnn/operations/prefetcher/prefetcher/device/kernels/reader_dram.cpp

+    constexpr uint32_t max_block_size = get_compile_time_arg_val(5);
+
+    constexpr uint32_t cb_id = tt::CBIndex::c_0;        // Reader cb
+    constexpr uint32_t addrs_cb_id = tt::CBIndex::c_1;  // Tensor addrs cb


I'm trying to understand who populates this but I can't seem to find the producer. How do the addresses actually get generated and fed in?

SeanNijjar · 2025-01-09T00:06:04Z

ttnn/cpp/ttnn/operations/prefetcher/prefetcher/device/dram_prefetcher_op_multi_core.cpp

+    uint32_t max_block_tiles = *std::max_element(tensor_block_num_tiles.begin(), tensor_block_num_tiles.end());
+    auto max_tile_size_iterator = std::max_element(tensor_tile_sizes.begin(), tensor_tile_sizes.end());
+    uint32_t max_tile_size = *max_tile_size_iterator;
+    uint32_t max_tile_size_tensor_idx = max_tile_size_iterator - tensor_tile_sizes.begin();


consider using std::distance

SeanNijjar · 2025-01-09T00:10:35Z

ttnn/cpp/ttnn/operations/prefetcher/prefetcher/device/dram_prefetcher_op_multi_core.cpp

+
+    /* Tiles */
+    tt::tt_metal::Tile tensor_addrs_tile = tensor_addrs.get_tensor_spec().tile();
+    std::vector<tt::tt_metal::Tile> tensor_tiles(tensors.size());


Suggested change

std::vector<tt::tt_metal::Tile> tensor_tiles(tensors.size());

std::vector<tt::tt_metal::Tile> tensor_tiles();

tensor_tiles.reserve(tensors.size());

std::copy(tensors.begin(), tensors.end(), std::back_inserter(tensor_tiles), [](auto const& t) { return t.get_tensor_spec().tile(); });

nit: readability
(unsure if we are safe to use ranges which could reduce it to
std::ranges::copy(tensors, std::back_inserter(tensor_tiles), [](auto const& t) { return t.get_tensor_spec().tile(); });

SeanNijjar · 2025-01-09T00:11:55Z

ttnn/cpp/ttnn/operations/prefetcher/prefetcher/device/dram_prefetcher_op_multi_core.cpp

+    tt::DataFormat tensor_addrs_data_format = tt::tt_metal::datatype_to_dataformat_converter(tensor_addrs.get_dtype());
+    std::vector<tt::DataFormat> tensor_data_formats(tensors.size());
+    for (size_t t = 0; t < tensors.size(); ++t) {
+        tensor_data_formats[t] = tt::tt_metal::datatype_to_dataformat_converter(tensors[t].get_dtype());


nit: consider std::copy

SeanNijjar · 2025-01-09T00:14:00Z

ttnn/cpp/ttnn/operations/prefetcher/prefetcher/device/dram_prefetcher_op_multi_core.cpp

+                                         tensor_addrs_buffer->shard_spec().shape()[1];  // TODO: check this
+    uint32_t tensor_addrs_cb_size =
+        num_layers * num_tensors *
+        tensor_addrs_single_tile_size;  // tensor_addrs_cb_num_tiles * tensor_addrs_single_tile_size;


commented code?

SeanNijjar · 2025-01-09T00:15:09Z

ttnn/cpp/ttnn/operations/prefetcher/prefetcher/device/dram_prefetcher_op_multi_core.cpp

+using namespace tt::constants;
+using namespace tt::tt_metal;
+
+void get_max_page_size_and_num_pages(


consider making this just return pair or tuple

avoraTT added LLMs on Metal llama3 llm_tg tg-llama labels Dec 20, 2024

avoraTT assigned avoraTT, tt-aho, johanna-rock-tt and yugaoTT Dec 20, 2024

avoraTT force-pushed the avora/pf-mm-rebase branch from 3578b03 to 496d0ca Compare December 31, 2024 14:02

yugaoTT reviewed Jan 2, 2025

View reviewed changes

avoraTT force-pushed the avora/pf-mm-rebase branch 2 times, most recently from e293fb0 to d669398 Compare January 3, 2025 04:31

tt-aho reviewed Jan 3, 2025

View reviewed changes

ttnn/cpp/ttnn/operations/prefetcher/prefetcher/device/dram_prefetcher_op_multi_core.cpp Outdated Show resolved Hide resolved

tt_metal/hw/inc/remote_circular_buffer_api.h Outdated Show resolved Hide resolved

tt_metal/tt_stl/reflection.hpp Outdated Show resolved Hide resolved

avoraTT force-pushed the avora/pf-mm-rebase branch 2 times, most recently from 91d1d1b to da819b4 Compare January 6, 2025 16:04

yugaoTT force-pushed the avora/pf-mm-rebase branch from d609c0c to a58fb9b Compare January 7, 2025 15:01

avoraTT force-pushed the avora/pf-mm-rebase branch from 1e74b2a to 35afec7 Compare January 7, 2025 20:15

avoraTT marked this pull request as ready for review January 7, 2025 20:19

avoraTT requested review from ayerofieiev-tt, dmakoviichuk-tt, yan-zaretskiy, rfurko-tt, cfjchu, TT-BrianLiu, razorback3, dongjin-na, bbradelTT, SeanNijjar, jvegaTT and asandhupatlaTT as code owners January 7, 2025 20:19

avoraTT and others added 28 commits January 8, 2025 22:50

Add fixes for og ring gather matmul 1d functional regression.

dc3da45

Add op docstring and clean up pytest.

a13d943

Add more checks in prefetcher validation.

09cf8ea

Remove commented code and remove hardcoded values.

e4b569a

Adress issues with post-commit, including clang-tidy, regression test…

75fcce4

…s, and fails on n150/n300.

Add TG prefethcer test to github workflow.

f606a79

Add post commit test for non TG.

0b6b878

Add skip for grayskull

d963b37

Add fixes for post commit

c92af7a

Rebase and add changes to make global cb std::optional. Disabling pos…

1201a1f

…t-commit tests for now.

Add fix for post-commit test. Hop cores were being used when they sho…

af5c7b0

…uldn't be.

Removing unnecessary specialization for vector of pairs.

c4ff469

Test TG Ci workaround

d0e8171

Add potential fix for FW size limit on post-commit. Reserve exact spa…

f92c6aa

…ce for vectors in prefetcher host code.

#0: potential fix for vc router out-of-memory

22b3ced

Increase TG unit test timeout.

18e7909

#0: split align

ed9e191

#0: fixes

b7d248b

Bump TG timeout.

0e84a71

#0: fix gs trisc overflow

29c6a8f

Revert "#0: fix gs trisc overflow"

0d7ac43

This reverts commit 8e405f1.

#0: bump up gs fw size by 32B

a3d9de3

#0: split tg test into another file

d59b0e0

#0: increase gs fw size by 128B

9021a59

#0: address comments

27c4d68

#0: address comments

e9f2edf

#0: Track local_cb_size to ensure that remote cb config is sent by FD

b2246c5

#0: rebase fix and update owner id

067b916

yugaoTT force-pushed the avora/pf-mm-rebase branch from 7b67467 to 067b916 Compare January 8, 2025 23:04

SeanNijjar reviewed Jan 9, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for DRAM Prefetcher op #16244

Add support for DRAM Prefetcher op #16244

avoraTT commented Dec 20, 2024 •

edited by yugaoTT

Loading

SeanNijjar left a comment

SeanNijjar Jan 8, 2025

SeanNijjar Jan 9, 2025

SeanNijjar Jan 9, 2025

SeanNijjar Jan 9, 2025

SeanNijjar Jan 9, 2025

SeanNijjar Jan 9, 2025

SeanNijjar Jan 9, 2025

-    std::vector<tt::tt_metal::Tile> tensor_tiles(tensors.size());
+    std::vector<tt::tt_metal::Tile> tensor_tiles();
+    tensor_tiles.reserve(tensors.size());
+    std::copy(tensors.begin(), tensors.end(), std::back_inserter(tensor_tiles), [](auto const& t) { return t.get_tensor_spec().tile(); });

Add support for DRAM Prefetcher op #16244

Are you sure you want to change the base?

Add support for DRAM Prefetcher op #16244

Conversation

avoraTT commented Dec 20, 2024 • edited by yugaoTT Loading

Ticket

Problem description

What's changed

Interface

Prefetcher

Matmul

Putting it all together

Checklist

SeanNijjar left a comment

Choose a reason for hiding this comment

SeanNijjar Jan 8, 2025

Choose a reason for hiding this comment

SeanNijjar Jan 9, 2025

Choose a reason for hiding this comment

SeanNijjar Jan 9, 2025

Choose a reason for hiding this comment

SeanNijjar Jan 9, 2025

Choose a reason for hiding this comment

SeanNijjar Jan 9, 2025

Choose a reason for hiding this comment

SeanNijjar Jan 9, 2025

Choose a reason for hiding this comment

SeanNijjar Jan 9, 2025

Choose a reason for hiding this comment

avoraTT commented Dec 20, 2024 •

edited by yugaoTT

Loading