Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problems met when trying the code #2

Closed
chenzhengdeeplearning opened this issue Jan 4, 2021 · 33 comments
Closed

Problems met when trying the code #2

chenzhengdeeplearning opened this issue Jan 4, 2021 · 33 comments
Labels
documentation Improvements or additions to documentation

Comments

@chenzhengdeeplearning
Copy link

chenzhengdeeplearning commented Jan 4, 2021

docker: Error response from daemon: OCI runtime create failed: container_linux.go:370: starting container process caused: process_linux.go:459: container init caused: Running hook #1:: error running hook: exit status 1, stdout: , stderr: nvidia-container-cli: requirement error: unsatisfied condition: cuda>=11.0, please update your driver to a newer version, or use an earlier cuda container: unknown. docker: Error response from daemon: requirement error: unsatisfied condition: cuda>=11.0, please update your driver to a newer version, or use an earlier cuda container: unknown. 
@VisualJoyce
Copy link
Collaborator

This is because your machine is not using latest driver.

Please try this tag of docker instead, visualjoyce/chengyubert:1.6.0-cuda10.1-cudnn7-devel.

@chenzhengdeeplearning
Copy link
Author

sorry, how to use visualjoyce/chengyubert:1.6.0-cuda10.1-cudnn7-devel ?

@VisualJoyce
Copy link
Collaborator

diff --git a/docker_train.sh b/docker_train.sh
index 3788e15..b21df38 100644
--- a/docker_train.sh
+++ b/docker_train.sh
@@ -45,6 +45,6 @@ docker run --gpus '"'device=$CUDA_VISIBLE_DEVICES'"' --ipc=host --rm -it \
   --mount src="$ANNOTATION_DIR",dst=/annotations,type=bind,readonly \
   --mount src="$TXT_DB",dst=/txt,type=bind \
   -e NVIDIA_VISIBLE_DEVICES=$CUDA_VISIBLE_DEVICES \
-  -w /src visualjoyce/chengyubert:latest \
+  -w /src visualjoyce/chengyubert:1.6.0-cuda10.1-cudnn7-devel \
   bash -c " PYTHONPATH=/src ${MODEL_PARA} ${HOROVOD_PARA} \\
     python train_${SUB_PROJECT}.py --config=$CONFIG_DIR/$CONFIG_FILE"

@chenzhengdeeplearning
Copy link
Author

docker run --gpus '"'device=$CUDA_VISIBLE_DEVICES'"' --ipc=host --rm -it
--mount src="$ANNOTATION_DIR",dst=/annotations,type=bind,readonly
--mount src="$TXT_DB",dst=/txt,type=bind
-e NVIDIA_VISIBLE_DEVICES=$CUDA_VISIBLE_DEVICES \

  • -w /src visualjoyce/chengyubert:latest \
  • -w /src visualjoyce/chengyubert:1.6.0-cuda10.1-cudnn7-devel
    bash -c " PYTHONPATH=/src ${MODEL_PARA} ${HOROVOD_PARA} \
    python train_${SUB_PROJECT}.py --config=$CONFIG_DIR/$CONFIG_FILE"
    This part is in docker_train.sh

But,What is this mean?
diff --git a/docker_train.sh b/docker_train.sh
index 3788e15..b21df38 100644
--- a/docker_train.sh
+++ b/docker_train.sh
@@ -45,6 +45,6 @@

@chenzhengdeeplearning
Copy link
Author

chenzhengdeeplearning commented Jan 4, 2021

It was this before

--mount src="${WORK_DIR}",dst=/src,type=bind \
  --mount src="$OUTPUT",dst=/storage,type=bind \
  --mount src="$PRETRAIN_DIR",dst=/pretrain,type=bind,readonly \
  --mount src=$ANNOTATION_DIR,dst=/annotations,type=bind,readonly \
--mount src="$TXT_DB",dst=/txt,type=bind \

Now is that

--mount src="$ANNOTATION_DIR",dst=/annotations,type=bind,readonly \
  --mount src="$TXT_DB",dst=/txt,type=bind \

Is my understanding right? 3 rows are deleted

@VisualJoyce
Copy link
Collaborator

A diff file shows editing made, the line starts with - is removed, + means added.

My last post means, you need to change visualjoyce/chengyubert:latest to visualjoyce/chengyubert:1.6.0-cuda10.1-cudnn7-devel.

@chenzhengdeeplearning
Copy link
Author

chenzhengdeeplearning commented Jan 4, 2021

Thanks for your patience.

parser.add_argument("--model", default='paired',
                        choices=['snlive'],
                        help="choose from 2 model architecture")

What are paired and snlive mean?

Here is my another error.

 File "train_official.py", line 304, in main
    raise ValueError(f"No such model [{opts.model}] supported!")
ValueError: No such model [paired] supported!

@VisualJoyce
Copy link
Collaborator

This is due to copy-paste from an earlier code. Now you may ignore the parameter as it's overwritten by MODEL=chengyubert-dual in the command line.

I will fix that on my next version.

@chenzhengdeeplearning
Copy link
Author

chenzhengdeeplearning commented Jan 4, 2021

Sorry that there are so many errors I met...
Another error is that. I even don't know why it occurs, because there is only one process in my computer.

Primary job terminated normally, but 1 process returned
a non-zero exit code.. Per user-direction, the job has been aborted.
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated.

@VisualJoyce
Copy link
Collaborator

That's OK. Thank you for pointing these out!

The code supports multiple GPU. If you only have one GPU, then use CUDA_VISIBLE_DEVICES=0.

@chenzhengdeeplearning
Copy link
Author

ok!
And what is 'len_idiom_vocab'?

@VisualJoyce
Copy link
Collaborator

len_idiom_vocab is 3848 for ChID dataset.

Our next paper is under review which supports more than 30k idioms.

So it's a parameter for future compatibility.

@chenzhengdeeplearning
Copy link
Author

The mode I choose is 'train'.

Your code in train_official.py is this
if opts.mode == 'train':
# data loaders
splits, dataloaders = create_dataloaders(LOGGER, DatasetCls, EvalDatasetCls,
collate_fn, eval_collate_fn, opts, splits=['train', 'val'])
best_ckpt = train(model, dataloaders, opts)
else:
splits = []
for k in dir(opts):
if k.endswith('_txt_db'):

The error is that

AttributeError: 'Namespace' object has no attribute 'train_txt_db'

@VisualJoyce
Copy link
Collaborator

I can run the code on my machine with

CUDA_VISIBLE_DEVICES=0 CONFIG_FILE="train-official-bert-base-1gpu.json" \
bash docker_train.sh official \
"MODEL=chengyubert-dual ENLARGED_CANDIDATES=1 LEARNING_RATE=0.0001 NUM_TRAIN_STEPS=15003 GRADIENT_ACCUMULATION_STEPS=1 VALID_STEPS=100 GRAD_NORM=1"

Can you post the command line?

@chenzhengdeeplearning
Copy link
Author

This is mine.
CUDA_VISIBLE_DEVICES=0,1 CONFIG_FILE="train-official-bert-base-1gpu.json" bash docker_train.sh official "MODEL=chengyubert-dual ENLARGED_CANDIDATES=1 LEARNING_RATE=0.0001 NUM_TRAIN_STEPS=15003 GRADIENT_ACCUMULATION_STEPS=1 VALID_STEPS=100 GRAD_NORM=1"

And I run your command, it has the same error...
Is my data directory losing something?
I don't have train_txt_db..

(图片)

@VisualJoyce
Copy link
Collaborator

Have you done the preprocessing without error?

@VisualJoyce VisualJoyce reopened this Jan 4, 2021
@VisualJoyce VisualJoyce changed the title docker: Error response from daemon: OCI runtime create failed: container_linux.go:370: starting container process caused: process_linux.go:459: container init caused: Running hook #1:: error running hook: exit status 1, stdout: , stderr: nvidia-container-cli: requirement error: unsatisfied condition: cuda>=11.0, please update your driver to a newer version, or use an earlier cuda container: unknown. docker: Error response from daemon: requirement error: unsatisfied condition: cuda>=11.0, please update your driver to a newer version, or use an earlier cuda container: unknown. Jan 4, 2021
@chenzhengdeeplearning
Copy link
Author

Oh, I am so sorry for making wrong with that before.
Thanks for your reminding!

@VisualJoyce
Copy link
Collaborator

Also, your data/pretrained structure is not the same with the documentation.

@chenzhengdeeplearning
Copy link
Author

Copy that

@chenzhengdeeplearning
Copy link
Author

After preprocessing, I have the error too....

@chenzhengdeeplearning
Copy link
Author

My directories have locked, is that right?
图片

@VisualJoyce
Copy link
Collaborator

Locked is because they are generated from docker using root. This should be working.

@chenzhengdeeplearning
Copy link
Author

chenzhengdeeplearning commented Jan 4, 2021

ok, And..

[1,1]<stderr>:    raise EnvironmentError(msg)
[1,1]<stderr>:OSError: Can't load config for './pretrain/wwm_ext'. Make sure that:
[1,1]<stderr>:
[1,1]<stderr>:- './pretrain/wwm_ext' is a correct model identifier listed on 'https://huggingface.co/models'
[1,1]<stderr>:
[1,1]<stderr>:- or './pretrain/wwm_ext' is the correct path to a directory containing a config.json file

@VisualJoyce
Copy link
Collaborator

Do you have data/pretrained/wwm_ext? If not, you need to download BERT-wwm-ext from Chinese-BERT-wwm.

@VisualJoyce
Copy link
Collaborator

Or change the value of pretrained_model_name_or_path in the config file to hfl/chinese-bert-wwm-ext.

@chenzhengdeeplearning
Copy link
Author

It did work!
My computer doesn't have enough cuda memories. So I change the values of train batch size and val batch size to 2048.
And..
subprocess.CalledProcessError: Command '['git', 'status', '--short']' returned non-zero exit status 128.

What is that mean?

@VisualJoyce
Copy link
Collaborator

You need to paste the full log, it's hard to tell where the problem might be. When you post the log, try using code block to show the log in a user-friendly format.

@VisualJoyce VisualJoyce changed the title docker: Error response from daemon: requirement error: unsatisfied condition: cuda>=11.0, please update your driver to a newer version, or use an earlier cuda container: unknown. Problems met when trying the code Jan 4, 2021
@chenzhengdeeplearning
Copy link
Author

chenzhengdeeplearning commented Jan 4, 2021

[1,0]<stderr>:01/04/2021 16:11:34 - INFO - __main__ -   Waiting on git info....
[1,0]<stderr>:fatal: not a git repository (or any parent up to mount point /)
[1,0]<stderr>:Stopping at filesystem boundary (GIT_DISCOVERY_ACROSS_FILESYSTEM not set).
[1,0]<stderr>:01/04/2021 16:11:34 - INFO - __main__ -   Git branch: 
[1,0]<stderr>:fatal: not a git repository (or any parent up to mount point /)
[1,0]<stderr>:Stopping at filesystem boundary (GIT_DISCOVERY_ACROSS_FILESYSTEM not set).
[1,0]<stderr>:01/04/2021 16:11:34 - INFO - __main__ -   Git SHA: 
[1,0]<stderr>:fatal: not a git repository (or any parent up to mount point /)
[1,0]<stderr>:Stopping at filesystem boundary (GIT_DISCOVERY_ACROSS_FILESYSTEM not set).
[1,0]<stderr>:Traceback (most recent call last):
[1,0]<stderr>:  File "train_official.py", line 468, in <module>
[1,0]<stderr>:    main(args)
[1,0]<stderr>:  File "train_official.py", line 317, in main
[1,0]<stderr>:    best_ckpt = train(model, dataloaders, opts)
[1,0]<stderr>:  File "train_official.py", line 49, in train
[1,0]<stderr>:    save_training_meta(opts)
[1,0]<stderr>:  File "/src/chengyubert/utils/save.py", line 45, in save_training_meta
[1,0]<stderr>:    cwd=git_dir, universal_newlines=True).strip()
[1,0]<stderr>:  File "/opt/conda/lib/python3.7/subprocess.py", line 411, in check_output
[1,0]<stderr>:    **kwargs).stdout
[1,0]<stderr>:  File "/opt/conda/lib/python3.7/subprocess.py", line 512, in run
[1,0]<stderr>:    output=stdout, stderr=stderr)
[1,0]<stderr>:subprocess.CalledProcessError: Command '['git', 'status', '--short']' returned non-zero exit status 128.

Thanks you very much!

@VisualJoyce
Copy link
Collaborator

Are you using a cloned repo or downloaded the zip from master?

@chenzhengdeeplearning
Copy link
Author

Yes, I downloaded the zip before

@VisualJoyce
Copy link
Collaborator

The code is trying to query the git info and failed. Either you clone the repo or comment out the line which is trying to query git status.

@VisualJoyce VisualJoyce pinned this issue Jan 4, 2021
@VisualJoyce VisualJoyce added the documentation Improvements or additions to documentation label Jan 4, 2021
@chenzhengdeeplearning
Copy link
Author

It finally works!
Thank you sooooo much!!!

@VisualJoyce
Copy link
Collaborator

Glad that works! I will add more details on next release.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation
Projects
None yet
Development

No branches or pull requests

1 participant