Are InfiniBand and Torch Elastic system requirements? #7

deckar01 · 2023-09-07T23:53:42Z

Megatron seems to be trying to connect to InfiniBand even when NCCL_NET=Socket, Error: network IB not found.. docker_launch.sh has --device=/dev/infiniband with no mention in the readme of any related architecture requirements.

Running run_text_generation_server.py hits ValueError: Error initializing torch.distributed using env:// rendezvous: environment variable MASTER_ADDR expected, but not set error which again seems to suggest the environment is expected to be running on a specific server configuration.

Here are some miscellaneous other error I ran into when building:

Readme docker build command errors on Windows.

docker build -f docker/Dockerfile -t 'adeptdocker' . -> docker build -f docker/Dockerfile -t adeptdocker .

flash-attn==2.0.0.post1 fails to install and retries for 20 minutes locking docker into an operation that can't be aborted without rebooting.

pip install flash-attn==2.0.0.post1 -> pip install flash-attn==2.2.1

The text was updated successfully, but these errors were encountered:

abacaj · 2023-09-08T03:08:05Z

Got this working on my local 3090, am adding modifications here: https://github.com/abacaj/adept-inference-local-3090

abacaj · 2023-09-08T03:56:33Z

This change is working for me: abacaj@4e6a503

If you don't want to run this in docker, I didn't. You'll need to follow these steps at minimum:

git clone https://github.com/HazyResearch/flash-attention \
    && cd flash-attention && git checkout b8020d73c9e068665586989883083a4a5429a443 \
    && cd csrc/ft_attention && pip install .

cd megatron/fused_kernels \
    && python setup.py install sdist

Next find directory of the installed fused kernels path, mine was as below (venv):

cd /home/anton/personal/transformer-experiments/env/lib/python3.10/site-packages/megatron_fused_kernels-0.0.0-py3.10-linux-x86_64.egg \
    && mv *.so megatron_fused_kernels/

Try running app with command in root project directory:

torchrun --nproc-per-node 1 --nnodes 1 run_text_generation_server.py \
    --no-load-rng \
    --no-load-optim \
    --no-initialization \
    --top_p 0.9 \
    --port 6001 \
    --micro-batch-size 1 \
    --load 8b_base_model_release \
    --use-flash-attn \
    --sp-model-file 8b_base_model_release/adept_vocab.model \
    --bf16 \
    --inference-max-seqlen 4096

Send JSON PUT request to address:

http://127.0.0.1:6001/api

Example body:

{
    "prompts": [
        "Hello"
    ]
}

ekelsen · 2023-09-08T06:21:41Z

Thanks for figuring this out for everyone @abacaj

aharmax · 2023-10-18T09:01:30Z

Hi all, is there a solution for the problem of missing /dev/infiniband device? I have no rights to install the infiniband driver in my system.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Are InfiniBand and Torch Elastic system requirements? #7

Are InfiniBand and Torch Elastic system requirements? #7

deckar01 commented Sep 7, 2023

abacaj commented Sep 8, 2023

abacaj commented Sep 8, 2023 •

edited

Loading

ekelsen commented Sep 8, 2023

aharmax commented Oct 18, 2023

Are InfiniBand and Torch Elastic system requirements? #7

Are InfiniBand and Torch Elastic system requirements? #7

Comments

deckar01 commented Sep 7, 2023

abacaj commented Sep 8, 2023

abacaj commented Sep 8, 2023 • edited Loading

ekelsen commented Sep 8, 2023

aharmax commented Oct 18, 2023

abacaj commented Sep 8, 2023 •

edited

Loading