Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Are InfiniBand and Torch Elastic system requirements? #7

Open
deckar01 opened this issue Sep 7, 2023 · 4 comments
Open

Are InfiniBand and Torch Elastic system requirements? #7

deckar01 opened this issue Sep 7, 2023 · 4 comments

Comments

@deckar01
Copy link

deckar01 commented Sep 7, 2023

Megatron seems to be trying to connect to InfiniBand even when NCCL_NET=Socket, Error: network IB not found.. docker_launch.sh has --device=/dev/infiniband with no mention in the readme of any related architecture requirements.

Running run_text_generation_server.py hits ValueError: Error initializing torch.distributed using env:// rendezvous: environment variable MASTER_ADDR expected, but not set error which again seems to suggest the environment is expected to be running on a specific server configuration.

Here are some miscellaneous other error I ran into when building:

Readme docker build command errors on Windows.

docker build -f docker/Dockerfile -t 'adeptdocker' . -> docker build -f docker/Dockerfile -t adeptdocker .

flash-attn==2.0.0.post1 fails to install and retries for 20 minutes locking docker into an operation that can't be aborted without rebooting.

pip install flash-attn==2.0.0.post1 -> pip install flash-attn==2.2.1

@abacaj
Copy link

abacaj commented Sep 8, 2023

Got this working on my local 3090, am adding modifications here: https://github.com/abacaj/adept-inference-local-3090
image

@abacaj
Copy link

abacaj commented Sep 8, 2023

This change is working for me: abacaj@4e6a503

If you don't want to run this in docker, I didn't. You'll need to follow these steps at minimum:

git clone https://github.com/HazyResearch/flash-attention \
    && cd flash-attention && git checkout b8020d73c9e068665586989883083a4a5429a443 \
    && cd csrc/ft_attention && pip install .

Next:

cd megatron/fused_kernels \
    && python setup.py install sdist

Next find directory of the installed fused kernels path, mine was as below (venv):

cd /home/anton/personal/transformer-experiments/env/lib/python3.10/site-packages/megatron_fused_kernels-0.0.0-py3.10-linux-x86_64.egg \
    && mv *.so megatron_fused_kernels/

Try running app with command in root project directory:

torchrun --nproc-per-node 1 --nnodes 1 run_text_generation_server.py \
    --no-load-rng \
    --no-load-optim \
    --no-initialization \
    --top_p 0.9 \
    --port 6001 \
    --micro-batch-size 1 \
    --load 8b_base_model_release \
    --use-flash-attn \
    --sp-model-file 8b_base_model_release/adept_vocab.model \
    --bf16 \
    --inference-max-seqlen 4096

Send JSON PUT request to address:

http://127.0.0.1:6001/api

Example body:

{
    "prompts": [
        "Hello"
    ]
}

@ekelsen
Copy link
Contributor

ekelsen commented Sep 8, 2023

Thanks for figuring this out for everyone @abacaj

@aharmax
Copy link

aharmax commented Oct 18, 2023

Hi all, is there a solution for the problem of missing /dev/infiniband device? I have no rights to install the infiniband driver in my system.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants