-
Notifications
You must be signed in to change notification settings - Fork 88
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for DRAM Prefetcher op #16244
base: main
Are you sure you want to change the base?
Conversation
3578b03
to
496d0ca
Compare
...erations/matmul/device/kernels/compute/bmm_large_block_zm_fused_bias_activation_gathered.cpp
Outdated
Show resolved
Hide resolved
...erations/matmul/device/kernels/compute/bmm_large_block_zm_fused_bias_activation_gathered.cpp
Outdated
Show resolved
Hide resolved
ttnn/cpp/ttnn/operations/prefetcher/prefetcher/device/dram_prefetcher_op_multi_core.cpp
Outdated
Show resolved
Hide resolved
ttnn/cpp/ttnn/operations/prefetcher/prefetcher/device/dram_prefetcher_op_multi_core.cpp
Outdated
Show resolved
Hide resolved
ttnn/cpp/ttnn/operations/prefetcher/prefetcher/device/dram_prefetcher_op_multi_core.cpp
Outdated
Show resolved
Hide resolved
e293fb0
to
d669398
Compare
ttnn/cpp/ttnn/operations/prefetcher/prefetcher/device/dram_prefetcher_op_multi_core.cpp
Outdated
Show resolved
Hide resolved
91d1d1b
to
da819b4
Compare
d609c0c
to
a58fb9b
Compare
1e74b2a
to
35afec7
Compare
…s, and fails on n150/n300.
…t-commit tests for now.
…ce for vectors in prefetcher host code.
This reverts commit 8e405f1.
7b67467
to
067b916
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Didn't do too much deep dive before. Just nits and comments.
constexpr uint32_t num_receivers = get_compile_time_arg_val(3); | ||
constexpr uint32_t max_block_num_tiles = get_compile_time_arg_val(4); | ||
|
||
constexpr uint32_t local_cb_id = tt::CBIndex::c_0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can these just be specified as CT args?
constexpr uint32_t max_block_size = get_compile_time_arg_val(5); | ||
|
||
constexpr uint32_t cb_id = tt::CBIndex::c_0; // Reader cb | ||
constexpr uint32_t addrs_cb_id = tt::CBIndex::c_1; // Tensor addrs cb |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm trying to understand who populates this but I can't seem to find the producer. How do the addresses actually get generated and fed in?
uint32_t max_block_tiles = *std::max_element(tensor_block_num_tiles.begin(), tensor_block_num_tiles.end()); | ||
auto max_tile_size_iterator = std::max_element(tensor_tile_sizes.begin(), tensor_tile_sizes.end()); | ||
uint32_t max_tile_size = *max_tile_size_iterator; | ||
uint32_t max_tile_size_tensor_idx = max_tile_size_iterator - tensor_tile_sizes.begin(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
consider using std::distance
|
||
/* Tiles */ | ||
tt::tt_metal::Tile tensor_addrs_tile = tensor_addrs.get_tensor_spec().tile(); | ||
std::vector<tt::tt_metal::Tile> tensor_tiles(tensors.size()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
std::vector<tt::tt_metal::Tile> tensor_tiles(tensors.size()); | |
std::vector<tt::tt_metal::Tile> tensor_tiles(); | |
tensor_tiles.reserve(tensors.size()); | |
std::copy(tensors.begin(), tensors.end(), std::back_inserter(tensor_tiles), [](auto const& t) { return t.get_tensor_spec().tile(); }); |
nit: readability
(unsure if we are safe to use ranges which could reduce it to
std::ranges::copy(tensors, std::back_inserter(tensor_tiles), [](auto const& t) { return t.get_tensor_spec().tile(); });
tt::DataFormat tensor_addrs_data_format = tt::tt_metal::datatype_to_dataformat_converter(tensor_addrs.get_dtype()); | ||
std::vector<tt::DataFormat> tensor_data_formats(tensors.size()); | ||
for (size_t t = 0; t < tensors.size(); ++t) { | ||
tensor_data_formats[t] = tt::tt_metal::datatype_to_dataformat_converter(tensors[t].get_dtype()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: consider std::copy
tensor_addrs_buffer->shard_spec().shape()[1]; // TODO: check this | ||
uint32_t tensor_addrs_cb_size = | ||
num_layers * num_tensors * | ||
tensor_addrs_single_tile_size; // tensor_addrs_cb_num_tiles * tensor_addrs_single_tile_size; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
commented code?
using namespace tt::constants; | ||
using namespace tt::tt_metal; | ||
|
||
void get_max_page_size_and_num_pages( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
consider making this just return pair or tuple
Ticket
Problem description
One of the main blockers to achieving 80 t/s/u on TG for the Llama family of models, are the 5 DRAM-bound matmuls present in the model (QKV, DO, FF1/2/3).
What's changed
Add a new
ttnn.dram_prefetcher
op, that will run asynchronously in the background, and will prefetch weights for the matmuls from DRAM into L1.Interface
The DRAM Prefetcher Op takes in the following args:
[t1_l1, t2_l1, ..., t1_l2, t2_l2, ..., t1_l3, t2_l3, ...]
Prefetcher
Each of the DRAM banks have a closest core, which we call the dram reader core. The prefetcher runs on these cores. The reader kernel reads in a tensors from DRAM and stores it in a local CB. The writer kernel reads from the the local CB, and uses NOC0 to write to 2 neighboring cores, calling
remote_cb_push_back
on the Global CB provided by the user. These neighboring cores (aka receiver cores) are the consumers of the DRAM prefetched tensors. As such, the matmuls must be performed on aCoreRangeSet
that is made of these specific receiver cores. For 12 DRAM reader cores, they each have 2 neighbor cores that the prefetcher writes to, so we have 24 cores to perform the matmul on.Here's an example of what the grid looks like on a TG. The red cores are the DRAM reader cores and the purple cores are the receiver cores, ie the matmul cores.
Matmul
The prefetcher is designed to be paired up with a Matmul1D with the
gather_in0
mode (where the activations are ring gathered instead of being mcasted, see details in #14964). For this matmul, both the activations and weights must be sharded. When combined with the prefetcher the Global CB is used as a synchronization mechanism (remote_cb_wait_front
). This leads to a seamless overlap between the prefetcher writing weights into the matmul cores, and the matmul op consuming them.However, since both the ops involve data movement across cores (prefetcher: writing to receiver cores, matmul: gathering activations), it is important to use separate NOCs to eliminate NOC congestion. As such, the matmul ring is ordered in a specific fashion, such that only NOC1 is used (see diagram above).
As seen above, the NOC1 matmul rings contains an extra core at
4,8
, which is required to complete the ring while satisfying the constraint of only using NOC1. This core is called ahop_core
. This PR also adds support in Matmul1D'sgather_in0
mode to take in a list ofhop_cores
that are at the end of the ring. These cores are simply used for data movement and serve the purpose of completing the ring so that the activations can be gathered. As such, they are not involved in any computation.Here are the results for the ring gather in a FF1 matmul measured on a 900 MHz WH machine. Although the NOC0/1 ring is faster by itself, the NOC1 only ring with
hop_cores
does not slow down due to interference from the prefetcher.To handle different in1 tensor storage cases, the matmul compute kernel needs to manually handle the read pointers.
in the global CB, in1 tensor can have either contiguous allocation, or split into bottom and top parts, as it could reach the bottom of global Cb and wrap back to top of the CB. Each core can also start at different block ids, so a core can start reading from the top, then later read the bottom, and vice versa.
Putting it all together
To combine the prefetcher and the matmul, they each must run in their own SubDevice. The DRAM reader cores are placed in a SubDevice that is separate from the matmul cores. Once lauched, both these ops run in parallel, where the matmul stalls until it receives the weights from the prefetcher.
Checklist