Feature/transformer sequence sharding #67

japols · 2025-01-08T08:45:02Z

This PR adds a new sharding strategy shard_sequence for the transformer processor.

The current implementation (shard_heads) alternates between sharding across the sequence to sharding across heads for the sliding window attention mechanism. This requires two all-to-all communication steps per layer.

The shard_sequence strategy simplifies this process by keeping a sequence shard on each GPU and computing the sliding window attention locally. This requires a halo communication to exchange overlapping window segments (halos) between neighboring sequence shards.

Instead of 2 all-to-all communication steps per layer, the halo exchange only requires a single point-to-point communication between neighbouring GPUs, reducing communication time and improving scalability of model sharding across multiple GPUs.

The following benchmarking results show that using a 2 neighbor all-to-all (orange) is the best communication strategy to implement the halo exchange which consistently outperforms the old head-sharding strategy (blue):

This is an isolated fwd+bwds pass of 16 transformer layers with o96 input shapes, 1024 channels.

For a full training run on n320, o96 hidden we get the following increases in throughput (aligning with the benchmark results):

GPUs/Model	sharding strategy	avg time/batch (s)
2	shard_heads	1.38495
2	shard_sequence	1.29771
4	shard_heads	0.72034
4	shard_sequence	0.69254

[mlflow](https://mlflow.ecmwf.int/#/metric?runs=%5B%22ff99c1c794be4c69849ca6ad7e98e21e%22,%222fb2e79ac56c4fcea0d33d05569098c8%22,%2248e3ec3a3e854702adfbd29622fac8e9%22,%22d1b8c835c9cc4fc9b40e014bc10f7333%22%5D&metric=%22train_wmse_step%22&experiments=%5B%2245%22%5D&plot_metric_keys=%5B%22train_wmse_step%22%5D&plot_layout=%7B%22autosize%22:true,%22xaxis%22:%7B%7D,%22yaxis%22:%7B%7D%7D&x_axis=relative&y_axis_scale=linear&line_smoothness=1&show_point=false&deselected_curves=%5B%5D&last_linear_y_axis_range=%5B%

@mishooax @ssmmnn11

…transformer_sequence_sharding

ssmmnn11

Very nice contribution :-)

ssmmnn11 · 2025-01-08T09:33:34Z

models/src/anemoi/models/distributed/transformer.py

@@ -130,6 +199,36 @@ def shard_sequence(input_: Tensor, shapes: list, mgroup: ProcessGroup) -> Tensor
    return _SplitSequenceParallelSection.apply(input_, shapes, mgroup)


+def halo_exchange(x: Tensor, halo_size: int, mgroup: ProcessGroup) -> Tensor:


I was wondering: we now have
halo_exchange
_halo_exchange
_HaloExchange

would it make sense to come up with more unique / more descriptive names for these? I think this might be a bit confusing. I admit that the names for the other routines (shard_heads etc.) are not great either.

ssmmnn11 · 2025-01-08T09:36:05Z

models/src/anemoi/models/layers/processor.py

@@ -97,6 +97,7 @@ def __init__(
        num_heads: int = 16,
        mlp_hidden_ratio: int = 4,
        dropout_p: float = 0.1,
+        shard_strategy: str = "shard_heads",


Add to doc string below?

is this value configurable? (how can one override the default?)

ssmmnn11 · 2025-01-08T09:45:27Z

models/src/anemoi/models/layers/attention.py

-            einops.rearrange(
-                t,
-                "(batch grid) (heads vars) -> batch heads grid vars",
+        if self.shard_strategy == "shard_sequence":


this is now very long. can we introduce e.g. something like

`if if self.shard_strategy == "shard_sequence":
x = self.shard_sequence(x)

query, key, value = self.lin_qkv(x).chunk(3, -1)

query, key, value = (
einops.rearrange(
t,
"(batch grid) (heads vars) -> batch heads grid vars",
batch=batch_size,
heads=self.num_heads,
)
for t in (query, key, value)
)

if if self.shard_strategy == "shard_heads"
query = shard_heads(query, shapes=shapes, mgroup=model_comm_group)
key = shard_heads(key, shapes=shapes, mgroup=model_comm_group)
value = shard_heads(value, shapes=shapes, mgroup=model_comm_group)
.
.
.
.
`

agreed, the if & else blocks should be refactored as separate (member) functions

ssmmnn11 · 2025-01-08T09:49:55Z

models/src/anemoi/models/layers/attention.py

@@ -104,7 +144,11 @@ def forward(
                dropout_p=dropout_p,
            )  # expects (batch heads grid variable) format

-        out = shard_sequence(out, shapes=shapes, mgroup=model_comm_group)
+        if self.shard_strategy == "shard_sequence":
+            out = out[:, :, halo_size_left : out.shape[-2] - halo_size_right, :]  # remove halos


I would prefer if this would happen in a function that lives at the same place as halo_exchange, e.g. call halo_expand first and then halo_contract (not best names).

maybe just remove_halos

mishooax · 2025-01-08T10:00:44Z

models/src/anemoi/models/layers/attention.py

+        if self.shard_strategy == "shard_sequence":
+            assert (
+                shapes[-1][0] // 2 >= self.window_size[0]
+            ), "Sharded sequence length must be at least twice the window size"


we could have the assert print the sharded sequence length and window size so the user sees the values that raised the error?

mishooax · 2025-01-08T10:05:00Z

excellent work! 👏

japols added 8 commits October 29, 2024 10:49

feat: Initial transformer sequence sharding version

e3c5283

feat: shard_strategy configurable via config.model.processor

e727e3c

Merge branch 'develop' into feature/transformer_sequence_sharding

a847f1a

feat: configurable halo comm strategies (for benchmarking)

ec0ac2d

cleanup, use all_to_all for halo exchange

0aaf0f9

docs: changelog

943a0d4

Merge branch 'develop' into feature/transformer_sequence_sharding

063844b

Merge commit '063844bde41582d243849d57a7ada39b7d6c3a65' into feature/…

d1483b0

…transformer_sequence_sharding

japols self-assigned this Jan 8, 2025

ssmmnn11 requested changes Jan 8, 2025

View reviewed changes

mishooax reviewed Jan 8, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature/transformer sequence sharding #67

Feature/transformer sequence sharding #67

japols commented Jan 8, 2025

ssmmnn11 left a comment

ssmmnn11 Jan 8, 2025

ssmmnn11 Jan 8, 2025

mishooax Jan 8, 2025

ssmmnn11 Jan 8, 2025

mishooax Jan 8, 2025

ssmmnn11 Jan 8, 2025

ssmmnn11 Jan 8, 2025

mishooax Jan 8, 2025

mishooax commented Jan 8, 2025

		@@ -130,6 +199,36 @@ def shard_sequence(input_: Tensor, shapes: list, mgroup: ProcessGroup) -> Tensor
		return _SplitSequenceParallelSection.apply(input_, shapes, mgroup)


		def halo_exchange(x: Tensor, halo_size: int, mgroup: ProcessGroup) -> Tensor:

Feature/transformer sequence sharding #67

Are you sure you want to change the base?

Feature/transformer sequence sharding #67

Conversation

japols commented Jan 8, 2025

ssmmnn11 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mishooax commented Jan 8, 2025