Skip to content

Commit

Permalink
Merge (#1)
Browse files Browse the repository at this point in the history
* import ParallelMode (EleutherAI#166)

## fix typo on tensor parallel tutorial

- `from oslo import ParallelContext, ParallelMode`

* [Fix] zero param check (EleutherAI#164)

## Title

- [Fix] zero param check

## Description

- ZeRO checks the redundancy of parameters to calculate the norm. There
is a minor bug in checking the TP and needs to be fixed.

## Linked Issues

- N/A

* [Fix] zero optimizer w/ tensor parallel test (EleutherAI#167)

## Title

- [Fix] zero optimizer w/ tensor parallel test

## Description

- ZeRO was not running in tensor parallel mode, so I fixed this by
switching to a model from `transformers`.

## Linked Issues

- N/A

* Add restarting model from saved model and fix bug (EleutherAI#171)

## Description

- load a model
- start training again from a saved point
- fix bug that training_arg not saved with nccl error. It was because of
parallel_context, and it was removed before saving training_arg and
re-attached again
- test load and restart with oslo TP

* Make decoder-only models to be able to generate with `inputs_embeds` (EleutherAI#172)

## Title
Make decoder-only models to be able to generate with `inputs_embeds`

## Description
Synchronize GPT2 code with Hugging Face transformers—GPT2 can generate
with `input_embeds`.

>Accepting `.generate()` calls with `inputs_embeds` on decoder-only
models is a long-standing request
(huggingface/transformers#6535) -- see
huggingface/transformers#6535 (comment)
particular and its reacts.
>
>It has to be added on a per-model basis, and this PR adds the necessary
changes for GPT2. Other models will throw an informative exception if
the user passes `inputs_embeds`, asking them to check this PR and
implement the same pattern on the model they want to use it with 🤗
>
>Please note that it is still expected that the user passes `input_ids`,
i.e.

```python
outputs = model.generate(input_ids, inputs_embeds=inputs_embeds)
```

>This is because decoder-only models expect the prompt to be present in
the output, and this is the only way to preserve it! input_ids can also
be omitted and, in that case, the output won't contain the prompt.

For more details, please check out [this
PR](huggingface/transformers#21405).

* Wrong import in zero (EleutherAI#169)

## Title

Prevent from using torch 2.0

## Description

- Some of feature have changed in torch 2.0. and oslo has dependency on
torch._six which no longer support by torch 2.0.

olso Dependency
-
https://github.com/EleutherAI/oslo/blob/910c789e7f46d2876b964c221d31984b7924974f/oslo/torch/nn/parallel/data_parallel/zero/sharded_optim/_utils.py#L19

other issues
- microsoft/DeepSpeed#2845

## Linked Issues

- resolved #00

* [Fix] Support gradient accumulation for DDP (EleutherAI#173)

## Description

In order to support gradient accumulation, I removed `free_storage`
function that can cause `CUDA error: an illegal memory access was
encountered` in many case. (but this change may lead to an increase in
memory consumption)
What do you guys think about this PR? @nijkah @jinwonkim93

* [Fix] minor bug for single output in _DistributedDataParallel (EleutherAI#177)

## Title

- Fix minor bug for single output in _DistributedDataParallel

## Description

- This PR addresses a minor bug in the `_DistributedDataParallel` class
when handling single output tensors. The changes include:

1. Update the `forward` method in `_DistributedDataParallel` to
correctly handle single output tensors.
2. Add new test cases in
`tests_deprecated/torch/nn/parallel/data_parallel/data_parallel.py` to
ensure the correct behavior for models with various output types (single
tensor, multiple tensors, and dictionary of tensors).

These updates will ensure that the `_DistributedDataParallel` class
works correctly with various output types, providing a more robust
solution for users.

## Linked Issues

- N/A

* [Enhance] Support ViT for TensorParallel (EleutherAI#155)

## Description

I added support for ViT in TensorParallel by appending config to
`_TensorParallelMapping`.
`PatchEmbed` layer in ViT does not have the `weight` parameter unlike
`Embedding` layer, so I replaced the `weight` parameter with a dummy
value to prevent an `AttributeError`.

Any feedback is welcome.

### Memory usage
mode | world_size=1 | world_size=2 | world_size=4 | world_size=8
-|-|-|-|-
1D | 1760MiB | 1126MiB | 789MiB |
2D | | | 589MiB |
2.5D (d=1) | | | 589MiB |
2.5D (d=2) | | | | 586MiB
3D | | | |

### TODO
- [ ] Benchmark with `world_size=8`
- [ ] Refactor slicing patch embedding
- [ ] Fix slicing logic to return the same value as `TensorParallel1D`

<details><summary>code for testing</summary>
<p>

```python
import os
import torch.multiprocessing as mp

import torch
from torch import nn
from torch import optim
import torch.distributed as dist
from transformers import ViTModel, ViTForImageClassification, ViTConfig

import oslo
from oslo.torch.distributed.parallel_context import ParallelContext
from oslo.torch.distributed.parallel_mode import ParallelMode
from oslo.torch.nn.parallel import TensorParallel


def setup(rank, world_size):
    os.environ["MASTER_ADDR"] = "localhost"
    os.environ["MASTER_PORT"] = "12340"
    os.environ["RANK"] = str(rank)
    os.environ["LOCAL_RANK"] = str(rank)
    os.environ["WORLD_SIZE"] = str(world_size)
    os.environ["LOCAL_WORLD_SIZE"] = str(world_size)


def cleanup():
    dist.destroy_process_group()


def train(rank, world_size):
    print(f"Running oslo TP example on rank {rank}.")
    setup(rank, world_size)
    parallel_context = ParallelContext.from_torch(
        tensor_parallel_size=world_size,
        tensor_parallel_mode=ParallelMode.TENSOR_1D,
    )  # TENSOR2D or TENSOR_2P5D

    model = ViTForImageClassification(ViTConfig(num_labels=1000)).to(rank)
    model = TensorParallel(model, parallel_context)
    optimizer = optim.SGD(model.parameters(), lr=1e-4)
    loss_fn = nn.MSELoss()

    oslo.ready(model, parallel_context)

    for _ in range(100):
        model.zero_grad()
        logits = model(pixel_values=torch.ones(8, 3, 224, 224).to(rank)).logits
        labels = torch.ones(8, 1000).to(rank) * 100
        loss = loss_fn(logits, labels)
        loss.backward()
        optimizer.step()
        print(logits)
        print(torch.cuda.max_memory_allocated() / 1024**2)  # MB

    cleanup()


def main(world_size):
    mp.spawn(train, args=(world_size,), nprocs=world_size, join=True)


if __name__ == "__main__":
    main(4)
```

</p>
</details> 

## Linked Issues

Related to EleutherAI#152

---------

Co-authored-by: Minho Ryu <[email protected]>
Co-authored-by: Hansol Park <[email protected]>
Co-authored-by: Ingyu Seong <[email protected]>
Co-authored-by: whooray <[email protected]>
Co-authored-by: Junhwa Song <[email protected]>
  • Loading branch information
6 people authored Apr 18, 2023
1 parent ea3d9f7 commit 1987953
Show file tree
Hide file tree
Showing 37 changed files with 13,701 additions and 284 deletions.
2 changes: 1 addition & 1 deletion docs/CONCEPTS/tensor_model_parallelism.html
Original file line number Diff line number Diff line change
Expand Up @@ -296,7 +296,7 @@ <h2> Contents </h2>
<section class="tex2jax_ignore mathjax_ignore" id="concept-of-tensor-model-parallelism">
<h1>Concept of Tensor Model Parallelism<a class="headerlink" href="#concept-of-tensor-model-parallelism" title="Permalink to this heading">#</a></h1>
<ul class="simple">
<li><p>Authors: Kichang Yang, Kevin Ko</p></li>
<li><p>Authors: Kichang Yang, Kevin Ko, Minho Ryu</p></li>
</ul>
<p><strong>Tensor Model Parallelism</strong> makes it possible to train larger models by partitioning the parameter tensors into multiple dimensions.
We support 1D, 2D, 2.5D, and 3D tensor partitioning algorithms which make tensor parallel training more efficient.</p>
Expand Down
2 changes: 1 addition & 1 deletion docs/CONCEPTS/tp/1d_parallel_algorithm.html
Original file line number Diff line number Diff line change
Expand Up @@ -301,7 +301,7 @@ <h2>Usage<a class="headerlink" href="#usage" title="Permalink to this heading">#
<p>Use <code class="docutils literal notranslate"><span class="pre">ParallelMode.TENSOR_1D</span></code> as a parameter of <code class="docutils literal notranslate"><span class="pre">tensor_parallel_mode</span></code>. Model weight should be divisible by <code class="docutils literal notranslate"><span class="pre">tp_size</span></code>.</p>
<div class="highlight-python notranslate"><div class="highlight"><pre><span></span><span class="c1"># model = defined in section 2.2</span>

<span class="kn">from</span> <span class="nn">oslo</span> <span class="kn">import</span> <span class="n">ParallelContext</span>
<span class="kn">from</span> <span class="nn">oslo</span> <span class="kn">import</span> <span class="n">ParallelContext</span><span class="p">,</span> <span class="n">ParallelMode</span>
<span class="kn">from</span> <span class="nn">oslo.torch.nn.parallel</span> <span class="kn">import</span> <span class="n">TensorParallel</span>

<span class="n">tp_size</span> <span class="o">=</span> <span class="mi">4</span>
Expand Down
2 changes: 1 addition & 1 deletion docs/CONCEPTS/tp/2d_parallel_algorithm.html
Original file line number Diff line number Diff line change
Expand Up @@ -302,7 +302,7 @@ <h1>2D parallel (SUMMA) algorithm<a class="headerlink" href="#d-parallel-summa-a
<section id="usage">
<h2>Usage<a class="headerlink" href="#usage" title="Permalink to this heading">#</a></h2>
<p>Use <code class="docutils literal notranslate"><span class="pre">ParallelMode.TENSOR_2D</span></code> as a parameter of <code class="docutils literal notranslate"><span class="pre">tensor_parallel_mode</span></code>. Since the algorithm splits model along both rows and columns, <code class="docutils literal notranslate"><span class="pre">tp_size</span></code> should be a <strong>square of positive integer</strong>.</p>
<div class="highlight-python notranslate"><div class="highlight"><pre><span></span><span class="kn">from</span> <span class="nn">oslo</span> <span class="kn">import</span> <span class="n">ParallelContext</span>
<div class="highlight-python notranslate"><div class="highlight"><pre><span></span><span class="kn">from</span> <span class="nn">oslo</span> <span class="kn">import</span> <span class="n">ParallelContext</span><span class="p">,</span> <span class="n">ParallelMode</span>
<span class="kn">from</span> <span class="nn">oslo.torch.nn.parallel</span> <span class="kn">import</span> <span class="n">TensorParallel</span>

<span class="n">tp_size</span> <span class="o">=</span> <span class="mi">4</span>
Expand Down
2 changes: 1 addition & 1 deletion docs/CONCEPTS/tp/2p5d_parallel_algorithm.html
Original file line number Diff line number Diff line change
Expand Up @@ -302,7 +302,7 @@ <h2>Usage<a class="headerlink" href="#usage" title="Permalink to this heading">#
It is recommended to set <code class="docutils literal notranslate"><span class="pre">tp_depth</span></code> to more than 1, as the algorithm becomes identical to the 2D algorithm if <code class="docutils literal notranslate"><span class="pre">tp_depth</span></code> is 1.</p>
<div class="highlight-python notranslate"><div class="highlight"><pre><span></span><span class="c1"># model = defined in section 2.2</span>

<span class="kn">from</span> <span class="nn">oslo</span> <span class="kn">import</span> <span class="n">ParallelContext</span>
<span class="kn">from</span> <span class="nn">oslo</span> <span class="kn">import</span> <span class="n">ParallelContext</span><span class="p">,</span> <span class="n">ParallelMode</span>
<span class="kn">from</span> <span class="nn">oslo.torch.nn.parallel</span> <span class="kn">import</span> <span class="n">TensorParallel</span>

<span class="n">tp_size</span> <span class="o">=</span> <span class="mi">8</span>
Expand Down
2 changes: 1 addition & 1 deletion docs/CONCEPTS/tp/3d_parallel_algorithm.html
Original file line number Diff line number Diff line change
Expand Up @@ -301,7 +301,7 @@ <h2>Usage<a class="headerlink" href="#usage" title="Permalink to this heading">#
<p>Use <code class="docutils literal notranslate"><span class="pre">ParallelMode.TENSOR_3D</span></code> as a parameter of <code class="docutils literal notranslate"><span class="pre">tensor_parallel_mode</span></code>. <code class="docutils literal notranslate"><span class="pre">tp_size</span></code> should be a <strong>cubic of positive integer</strong>.</p>
<div class="highlight-python notranslate"><div class="highlight"><pre><span></span><span class="c1"># model = defined in section 2.2</span>

<span class="kn">from</span> <span class="nn">oslo</span> <span class="kn">import</span> <span class="n">ParallelContext</span>
<span class="kn">from</span> <span class="nn">oslo</span> <span class="kn">import</span> <span class="n">ParallelContext</span><span class="p">,</span> <span class="n">ParallelMode</span>
<span class="kn">from</span> <span class="nn">oslo.torch.nn.parallel</span> <span class="kn">import</span> <span class="n">TensorParallel</span>

<span class="n">tp_size</span> <span class="o">=</span> <span class="mi">8</span>
Expand Down
6 changes: 3 additions & 3 deletions docs/TUTORIALS/tensor_model_parallelism.html
Original file line number Diff line number Diff line change
Expand Up @@ -313,7 +313,7 @@ <h2> Contents </h2>
<section class="tex2jax_ignore mathjax_ignore" id="tensor-model-parallelism-tutorial">
<h1>Tensor Model Parallelism Tutorial<a class="headerlink" href="#tensor-model-parallelism-tutorial" title="Permalink to this heading">#</a></h1>
<ul class="simple">
<li><p>Authors: Kichang Yang, Kevin Ko</p></li>
<li><p>Authors: Kichang Yang, Kevin Ko, Minho Ryu</p></li>
</ul>
<p><img alt="260461C3-EA3B-405C-9B34-05BA3C781161.png" src="../_images/260461C3-EA3B-405C-9B34-05BA3C781161.png" /></p>
<p><strong>Tensor Model Parallelism</strong>
Expand Down Expand Up @@ -409,7 +409,7 @@ <h3>1.2. Parallelize the model<a class="headerlink" href="#parallelize-the-model
<li><p><code class="docutils literal notranslate"><span class="pre">pipeline_parallel_size</span></code> must be 1 if you want to use <code class="docutils literal notranslate"><span class="pre">tensor_parallel</span></code> algorithm ( mixing PP and PP will be supported in later version.)</p></li>
</ul>
<div class="highlight-python notranslate"><div class="highlight"><pre><span></span><span class="kn">import</span> <span class="nn">oslo</span>
<span class="kn">from</span> <span class="nn">oslo</span> <span class="kn">import</span> <span class="n">ParallelContext</span>
<span class="kn">from</span> <span class="nn">oslo</span> <span class="kn">import</span> <span class="n">ParallelContext</span><span class="p">,</span> <span class="n">ParallelMode</span>
<span class="kn">from</span> <span class="nn">oslo.torch.nn.parallel</span> <span class="kn">import</span> <span class="n">TensorParallel</span>

<span class="n">tp_size</span> <span class="o">=</span> <span class="mi">4</span>
Expand Down Expand Up @@ -480,7 +480,7 @@ <h3>2.2. Create model, optimizer and tokenizer<a class="headerlink" href="#creat
<h3>2.3. Parallelize the model<a class="headerlink" href="#id1" title="Permalink to this heading">#</a></h3>
<div class="highlight-python notranslate"><div class="highlight"><pre><span></span><span class="c1"># model = defined in section 2.2</span>

<span class="kn">from</span> <span class="nn">oslo</span> <span class="kn">import</span> <span class="n">ParallelContext</span>
<span class="kn">from</span> <span class="nn">oslo</span> <span class="kn">import</span> <span class="n">ParallelContext</span><span class="p">,</span> <span class="n">ParallelMode</span>
<span class="kn">from</span> <span class="nn">oslo.torch.nn.parallel</span> <span class="kn">import</span> <span class="n">TensorParallel</span>

<span class="n">tp_size</span> <span class="o">=</span> <span class="mi">4</span>
Expand Down
2 changes: 1 addition & 1 deletion docs/_sources/CONCEPTS/tensor_model_parallelism.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
# Concept of Tensor Model Parallelism
- Authors: Kichang Yang, Kevin Ko
- Authors: Kichang Yang, Kevin Ko, Minho Ryu

**Tensor Model Parallelism** makes it possible to train larger models by partitioning the parameter tensors into multiple dimensions.
We support 1D, 2D, 2.5D, and 3D tensor partitioning algorithms which make tensor parallel training more efficient.
Expand Down
2 changes: 1 addition & 1 deletion docs/_sources/CONCEPTS/tp/1d_parallel_algorithm.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@ Use `ParallelMode.TENSOR_1D` as a parameter of `tensor_parallel_mode`. Model wei
```python
# model = defined in section 2.2

from oslo import ParallelContext
from oslo import ParallelContext, ParallelMode
from oslo.torch.nn.parallel import TensorParallel

tp_size = 4
Expand Down
2 changes: 1 addition & 1 deletion docs/_sources/CONCEPTS/tp/2d_parallel_algorithm.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@ The result is a matrix $Y$ that is the product of $X$ and $A$.
Use `ParallelMode.TENSOR_2D` as a parameter of `tensor_parallel_mode`. Since the algorithm splits model along both rows and columns, `tp_size` should be a **square of positive integer**.

```python
from oslo import ParallelContext
from oslo import ParallelContext, ParallelMode
from oslo.torch.nn.parallel import TensorParallel

tp_size = 4
Expand Down
2 changes: 1 addition & 1 deletion docs/_sources/CONCEPTS/tp/2p5d_parallel_algorithm.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@ It is recommended to set `tp_depth` to more than 1, as the algorithm becomes ide
```python
# model = defined in section 2.2

from oslo import ParallelContext
from oslo import ParallelContext, ParallelMode
from oslo.torch.nn.parallel import TensorParallel

tp_size = 8
Expand Down
2 changes: 1 addition & 1 deletion docs/_sources/CONCEPTS/tp/3d_parallel_algorithm.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@ Use `ParallelMode.TENSOR_3D` as a parameter of `tensor_parallel_mode`. `tp_size`
```python
# model = defined in section 2.2

from oslo import ParallelContext
from oslo import ParallelContext, ParallelMode
from oslo.torch.nn.parallel import TensorParallel

tp_size = 8
Expand Down
6 changes: 3 additions & 3 deletions docs/_sources/TUTORIALS/tensor_model_parallelism.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
# Tensor Model Parallelism Tutorial
- Authors: Kichang Yang, Kevin Ko
- Authors: Kichang Yang, Kevin Ko, Minho Ryu

![260461C3-EA3B-405C-9B34-05BA3C781161.png](image/260461C3-EA3B-405C-9B34-05BA3C781161.png)

Expand Down Expand Up @@ -86,7 +86,7 @@ Here is some explain about arguments to parallel_context.

```python
import oslo
from oslo import ParallelContext
from oslo import ParallelContext, ParallelMode
from oslo.torch.nn.parallel import TensorParallel

tp_size = 4
Expand Down Expand Up @@ -157,7 +157,7 @@ tokenizer.pad_token = tokenizer.eos_token
```python
# model = defined in section 2.2

from oslo import ParallelContext
from oslo import ParallelContext, ParallelMode
from oslo.torch.nn.parallel import TensorParallel

tp_size = 4
Expand Down
Loading

0 comments on commit 1987953

Please sign in to comment.