-
Notifications
You must be signed in to change notification settings - Fork 88
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add subdevice support to multicore untilize #16193
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Clang-Tidy
found issue(s) with the introduced code (1/1)
uint32_t block_size_nbytes = input_single_tile_size; | ||
|
||
auto cores = corerange_to_cores(sub_core_grids, ncores, true); | ||
auto all_cores = num_cores_to_corerangeset_in_subcoregrids(cores[0], ncores, sub_core_grids, true); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use of undeclared identifier num_cores_to_corerangeset_in_subcoregrids
auto src_buffer = input_tensors.at(0).buffer(); | ||
auto dst_buffer = output_tensors.at(0).buffer(); | ||
{ | ||
auto& runtime_args_by_core = GetRuntimeArgs(program, reader_kernel_id); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use of undeclared identifier reader_kernel_id
} | ||
|
||
{ | ||
auto& runtime_args_by_core = GetRuntimeArgs(program, writer_kernel_id); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use of undeclared identifier writer_kernel_id
@@ -40,7 +40,8 @@ ttnn::Tensor ExecuteUntilize::invoke( | |||
const ttnn::Tensor& input_tensor, | |||
const std::optional<MemoryConfig>& memory_config, | |||
bool use_multicore, | |||
bool use_pack_untilize) { | |||
bool use_pack_untilize, | |||
const std::optional<CoreRangeSet> sub_core_grids) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the const qualified parameter sub_core_grids
is copied for each invocation; consider making it a reference
const std::optional<CoreRangeSet> sub_core_grids) { | |
const std::optional<CoreRangeSet>& sub_core_grids) { |
@@ -16,13 +16,15 @@ struct ExecuteUntilize { | |||
const ttnn::Tensor& input_tensor, | |||
const std::optional<MemoryConfig>& memory_config = std::nullopt, | |||
bool use_multicore = true, | |||
bool use_pack_untilize = true); | |||
bool use_pack_untilize = true, | |||
const std::optional<CoreRangeSet> sub_core_grids = std::nullopt); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the const qualified parameter sub_core_grids
is copied for each invocation; consider making it a reference
const std::optional<CoreRangeSet> sub_core_grids = std::nullopt); | |
const std::optional<CoreRangeSet>& sub_core_grids = std::nullopt); |
bool use_pack_untilize) { | ||
return invoke(DefaultQueueId, input_tensor, memory_config, use_multicore, use_pack_untilize); | ||
bool use_pack_untilize, | ||
const std::optional<CoreRangeSet> sub_core_grids) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the const qualified parameter sub_core_grids
is copied for each invocation; consider making it a reference
const std::optional<CoreRangeSet> sub_core_grids) { | |
const std::optional<CoreRangeSet>& sub_core_grids) { |
|
||
static ttnn::Tensor invoke( | ||
const ttnn::Tensor& input_tensor, | ||
const std::optional<MemoryConfig>& memory_config = std::nullopt, | ||
bool use_multicore = true, | ||
bool use_pack_untilize = true); | ||
bool use_pack_untilize = true, | ||
const std::optional<CoreRangeSet> sub_core_grids = std::nullopt); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the const qualified parameter sub_core_grids
is copied for each invocation; consider making it a reference
const std::optional<CoreRangeSet> sub_core_grids = std::nullopt); | |
const std::optional<CoreRangeSet>& sub_core_grids = std::nullopt); |
@@ -29,6 +30,190 @@ uint32_t get_largest_divisor(uint32_t dividend, uint32_t starting_divisor, uint3 | |||
return 1; | |||
} | |||
|
|||
operation::ProgramWithCallbacks untilize_multi_core_parallelize_column_subgrid( | |||
const Tensor& a, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is the support here 1:1 with the version without subgrids?
Asking because I'm curious if the validate needs updating.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the regular version splits the cores into core_range and core_range_cliff but when I ran it for my shapes, the core_range_cliff was empty so I think it is 1:1.
I think validate is fine.
uint32_t block_size_nbytes = input_single_tile_size; | ||
|
||
auto cores = corerange_to_cores(sub_core_grids, ncores, true); | ||
auto all_cores = num_cores_to_corerangeset_in_subcoregrids(cores[0], ncores, sub_core_grids, true); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is the only difference here that we're using a different work_split and num_cores_to_corerangeset_in_subcoregrids to get the cores?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If so, could you just have one function that if-elses when it's nullopt?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the multicore version has some computations for num_x_cores, num_y_cores and splits the cores into core_range and core_range_cliff:
auto [ncores, all_cores, core_range, core_range_cliff, nblocks_per_core, nblocks_per_core_cliff] =
ttnn::split_blocks_for_tilize(CoreCoord(ncores_x, ncores_y), nblocks);
"nb, nc, nh, nw", | ||
( | ||
# llama shapes | ||
(1, 1, 32, 128 * 1024), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No other shapes are needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for our model use case, this is the only one
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would you put some message describing the scenario when to use sub_core_grid version?
The scenario is very specific to a model use case - where we are working on dram prefetching on specific cores, so the ops need to be moved to a specific core-grid. I don't expect anyone to use this other than us right now. |
What's changed
Added sub_core_grids as an arg to use specific cores in the multicore version of the op
Checklist