Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for DRAM Prefetcher op #16244

Open
wants to merge 98 commits into
base: main
Choose a base branch
from
Open

Add support for DRAM Prefetcher op #16244

wants to merge 98 commits into from

Conversation

avoraTT
Copy link
Contributor

@avoraTT avoraTT commented Dec 20, 2024

Ticket

Problem description

One of the main blockers to achieving 80 t/s/u on TG for the Llama family of models, are the 5 DRAM-bound matmuls present in the model (QKV, DO, FF1/2/3).

What's changed

Add a new ttnn.dram_prefetcher op, that will run asynchronously in the background, and will prefetch weights for the matmuls from DRAM into L1.

Interface

The DRAM Prefetcher Op takes in the following args:

  • list of ttnn tensors (1 layer). This is used to get the shapes of all the tensors in 1 layer.
  • a ttnn tensors containing the DRAM addresses of all tensors in all layers.
    • The format is as follows: [t1_l1, t2_l1, ..., t1_l2, t2_l2, ..., t1_l3, t2_l3, ...]
  • number of layers
  • global circular buffer object

Prefetcher

Each of the DRAM banks have a closest core, which we call the dram reader core. The prefetcher runs on these cores. The reader kernel reads in a tensors from DRAM and stores it in a local CB. The writer kernel reads from the the local CB, and uses NOC0 to write to 2 neighboring cores, calling remote_cb_push_back on the Global CB provided by the user. These neighboring cores (aka receiver cores) are the consumers of the DRAM prefetched tensors. As such, the matmuls must be performed on a CoreRangeSet that is made of these specific receiver cores. For 12 DRAM reader cores, they each have 2 neighbor cores that the prefetcher writes to, so we have 24 cores to perform the matmul on.

Here's an example of what the grid looks like on a TG. The red cores are the DRAM reader cores and the purple cores are the receiver cores, ie the matmul cores.
image

Matmul

The prefetcher is designed to be paired up with a Matmul1D with the gather_in0 mode (where the activations are ring gathered instead of being mcasted, see details in #14964). For this matmul, both the activations and weights must be sharded. When combined with the prefetcher the Global CB is used as a synchronization mechanism (remote_cb_wait_front). This leads to a seamless overlap between the prefetcher writing weights into the matmul cores, and the matmul op consuming them.

However, since both the ops involve data movement across cores (prefetcher: writing to receiver cores, matmul: gathering activations), it is important to use separate NOCs to eliminate NOC congestion. As such, the matmul ring is ordered in a specific fashion, such that only NOC1 is used (see diagram above).

As seen above, the NOC1 matmul rings contains an extra core at 4,8, which is required to complete the ring while satisfying the constraint of only using NOC1. This core is called a hop_core. This PR also adds support in Matmul1D's gather_in0 mode to take in a list of hop_cores that are at the end of the ring. These cores are simply used for data movement and serve the purpose of completing the ring so that the activations can be gathered. As such, they are not involved in any computation.

Here are the results for the ring gather in a FF1 matmul measured on a 900 MHz WH machine. Although the NOC0/1 ring is faster by itself, the NOC1 only ring with hop_cores does not slow down due to interference from the prefetcher.

Matmul No interference (us) With interference (us) Slowdown due to interference
NOC0/1 ring 8.14 9.96 22.36%
NOC1 ring 8.67 8.78 1.27%

To handle different in1 tensor storage cases, the matmul compute kernel needs to manually handle the read pointers.
in the global CB, in1 tensor can have either contiguous allocation, or split into bottom and top parts, as it could reach the bottom of global Cb and wrap back to top of the CB. Each core can also start at different block ids, so a core can start reading from the top, then later read the bottom, and vice versa.

Putting it all together

To combine the prefetcher and the matmul, they each must run in their own SubDevice. The DRAM reader cores are placed in a SubDevice that is separate from the matmul cores. Once lauched, both these ops run in parallel, where the matmul stalls until it receives the weights from the prefetcher.

Checklist

@avoraTT avoraTT force-pushed the avora/pf-mm-rebase branch 2 times, most recently from e293fb0 to d669398 Compare January 3, 2025 04:31
avoraTT and others added 28 commits January 8, 2025 22:50
@yugaoTT yugaoTT force-pushed the avora/pf-mm-rebase branch from 7b67467 to 067b916 Compare January 8, 2025 23:04
Copy link
Contributor

@SeanNijjar SeanNijjar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Didn't do too much deep dive before. Just nits and comments.

constexpr uint32_t num_receivers = get_compile_time_arg_val(3);
constexpr uint32_t max_block_num_tiles = get_compile_time_arg_val(4);

constexpr uint32_t local_cb_id = tt::CBIndex::c_0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can these just be specified as CT args?

constexpr uint32_t max_block_size = get_compile_time_arg_val(5);

constexpr uint32_t cb_id = tt::CBIndex::c_0; // Reader cb
constexpr uint32_t addrs_cb_id = tt::CBIndex::c_1; // Tensor addrs cb
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm trying to understand who populates this but I can't seem to find the producer. How do the addresses actually get generated and fed in?

uint32_t max_block_tiles = *std::max_element(tensor_block_num_tiles.begin(), tensor_block_num_tiles.end());
auto max_tile_size_iterator = std::max_element(tensor_tile_sizes.begin(), tensor_tile_sizes.end());
uint32_t max_tile_size = *max_tile_size_iterator;
uint32_t max_tile_size_tensor_idx = max_tile_size_iterator - tensor_tile_sizes.begin();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

consider using std::distance


/* Tiles */
tt::tt_metal::Tile tensor_addrs_tile = tensor_addrs.get_tensor_spec().tile();
std::vector<tt::tt_metal::Tile> tensor_tiles(tensors.size());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
std::vector<tt::tt_metal::Tile> tensor_tiles(tensors.size());
std::vector<tt::tt_metal::Tile> tensor_tiles();
tensor_tiles.reserve(tensors.size());
std::copy(tensors.begin(), tensors.end(), std::back_inserter(tensor_tiles), [](auto const& t) { return t.get_tensor_spec().tile(); });

nit: readability
(unsure if we are safe to use ranges which could reduce it to
std::ranges::copy(tensors, std::back_inserter(tensor_tiles), [](auto const& t) { return t.get_tensor_spec().tile(); });

tt::DataFormat tensor_addrs_data_format = tt::tt_metal::datatype_to_dataformat_converter(tensor_addrs.get_dtype());
std::vector<tt::DataFormat> tensor_data_formats(tensors.size());
for (size_t t = 0; t < tensors.size(); ++t) {
tensor_data_formats[t] = tt::tt_metal::datatype_to_dataformat_converter(tensors[t].get_dtype());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: consider std::copy

tensor_addrs_buffer->shard_spec().shape()[1]; // TODO: check this
uint32_t tensor_addrs_cb_size =
num_layers * num_tensors *
tensor_addrs_single_tile_size; // tensor_addrs_cb_num_tiles * tensor_addrs_single_tile_size;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

commented code?

using namespace tt::constants;
using namespace tt::tt_metal;

void get_max_page_size_and_num_pages(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

consider making this just return pair or tuple

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants