Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add subdevice support to multicore untilize #16193

Open
wants to merge 11 commits into
base: main
Choose a base branch
from

Conversation

sraizada-tt
Copy link
Contributor

@sraizada-tt sraizada-tt commented Dec 19, 2024

What's changed

Added sub_core_grids as an arg to use specific cores in the multicore version of the op

Checklist

Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Clang-Tidy found issue(s) with the introduced code (1/1)

uint32_t block_size_nbytes = input_single_tile_size;

auto cores = corerange_to_cores(sub_core_grids, ncores, true);
auto all_cores = num_cores_to_corerangeset_in_subcoregrids(cores[0], ncores, sub_core_grids, true);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ clang-diagnostic-error ⚠️
use of undeclared identifier num_cores_to_corerangeset_in_subcoregrids

auto src_buffer = input_tensors.at(0).buffer();
auto dst_buffer = output_tensors.at(0).buffer();
{
auto& runtime_args_by_core = GetRuntimeArgs(program, reader_kernel_id);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ clang-diagnostic-error ⚠️
use of undeclared identifier reader_kernel_id

}

{
auto& runtime_args_by_core = GetRuntimeArgs(program, writer_kernel_id);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ clang-diagnostic-error ⚠️
use of undeclared identifier writer_kernel_id

@@ -40,7 +40,8 @@ ttnn::Tensor ExecuteUntilize::invoke(
const ttnn::Tensor& input_tensor,
const std::optional<MemoryConfig>& memory_config,
bool use_multicore,
bool use_pack_untilize) {
bool use_pack_untilize,
const std::optional<CoreRangeSet> sub_core_grids) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ performance-unnecessary-value-param ⚠️
the const qualified parameter sub_core_grids is copied for each invocation; consider making it a reference

Suggested change
const std::optional<CoreRangeSet> sub_core_grids) {
const std::optional<CoreRangeSet>& sub_core_grids) {

@@ -16,13 +16,15 @@ struct ExecuteUntilize {
const ttnn::Tensor& input_tensor,
const std::optional<MemoryConfig>& memory_config = std::nullopt,
bool use_multicore = true,
bool use_pack_untilize = true);
bool use_pack_untilize = true,
const std::optional<CoreRangeSet> sub_core_grids = std::nullopt);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ performance-unnecessary-value-param ⚠️
the const qualified parameter sub_core_grids is copied for each invocation; consider making it a reference

Suggested change
const std::optional<CoreRangeSet> sub_core_grids = std::nullopt);
const std::optional<CoreRangeSet>& sub_core_grids = std::nullopt);

bool use_pack_untilize) {
return invoke(DefaultQueueId, input_tensor, memory_config, use_multicore, use_pack_untilize);
bool use_pack_untilize,
const std::optional<CoreRangeSet> sub_core_grids) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ performance-unnecessary-value-param ⚠️
the const qualified parameter sub_core_grids is copied for each invocation; consider making it a reference

Suggested change
const std::optional<CoreRangeSet> sub_core_grids) {
const std::optional<CoreRangeSet>& sub_core_grids) {


static ttnn::Tensor invoke(
const ttnn::Tensor& input_tensor,
const std::optional<MemoryConfig>& memory_config = std::nullopt,
bool use_multicore = true,
bool use_pack_untilize = true);
bool use_pack_untilize = true,
const std::optional<CoreRangeSet> sub_core_grids = std::nullopt);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ performance-unnecessary-value-param ⚠️
the const qualified parameter sub_core_grids is copied for each invocation; consider making it a reference

Suggested change
const std::optional<CoreRangeSet> sub_core_grids = std::nullopt);
const std::optional<CoreRangeSet>& sub_core_grids = std::nullopt);

@@ -29,6 +30,190 @@ uint32_t get_largest_divisor(uint32_t dividend, uint32_t starting_divisor, uint3
return 1;
}

operation::ProgramWithCallbacks untilize_multi_core_parallelize_column_subgrid(
const Tensor& a,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is the support here 1:1 with the version without subgrids?

Asking because I'm curious if the validate needs updating.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the regular version splits the cores into core_range and core_range_cliff but when I ran it for my shapes, the core_range_cliff was empty so I think it is 1:1.
I think validate is fine.

uint32_t block_size_nbytes = input_single_tile_size;

auto cores = corerange_to_cores(sub_core_grids, ncores, true);
auto all_cores = num_cores_to_corerangeset_in_subcoregrids(cores[0], ncores, sub_core_grids, true);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is the only difference here that we're using a different work_split and num_cores_to_corerangeset_in_subcoregrids to get the cores?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If so, could you just have one function that if-elses when it's nullopt?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the multicore version has some computations for num_x_cores, num_y_cores and splits the cores into core_range and core_range_cliff:
auto [ncores, all_cores, core_range, core_range_cliff, nblocks_per_core, nblocks_per_core_cliff] =
ttnn::split_blocks_for_tilize(CoreCoord(ncores_x, ncores_y), nblocks);

"nb, nc, nh, nw",
(
# llama shapes
(1, 1, 32, 128 * 1024),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No other shapes are needed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for our model use case, this is the only one

Copy link
Contributor

@llongTT llongTT left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you put some message describing the scenario when to use sub_core_grid version?

@sraizada-tt
Copy link
Contributor Author

Would you put some message describing the scenario when to use sub_core_grid version?

The scenario is very specific to a model use case - where we are working on dram prefetching on specific cores, so the ops need to be moved to a specific core-grid. I don't expect anyone to use this other than us right now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants