Problems met when trying the code #2

chenzhengdeeplearning · 2021-01-04T06:45:07Z

docker: Error response from daemon: OCI runtime create failed: container_linux.go:370: starting container process caused: process_linux.go:459: container init caused: Running hook #1:: error running hook: exit status 1, stdout: , stderr: nvidia-container-cli: requirement error: unsatisfied condition: cuda>=11.0, please update your driver to a newer version, or use an earlier cuda container: unknown. docker: Error response from daemon: requirement error: unsatisfied condition: cuda>=11.0, please update your driver to a newer version, or use an earlier cuda container: unknown.

The text was updated successfully, but these errors were encountered:

VisualJoyce · 2021-01-04T06:54:15Z

This is because your machine is not using latest driver.

Please try this tag of docker instead, visualjoyce/chengyubert:1.6.0-cuda10.1-cudnn7-devel.

chenzhengdeeplearning · 2021-01-04T07:07:00Z

sorry, how to use visualjoyce/chengyubert:1.6.0-cuda10.1-cudnn7-devel ?

VisualJoyce · 2021-01-04T07:58:57Z

diff --git a/docker_train.sh b/docker_train.sh
index 3788e15..b21df38 100644
--- a/docker_train.sh
+++ b/docker_train.sh
@@ -45,6 +45,6 @@ docker run --gpus '"'device=$CUDA_VISIBLE_DEVICES'"' --ipc=host --rm -it \
   --mount src="$ANNOTATION_DIR",dst=/annotations,type=bind,readonly \
   --mount src="$TXT_DB",dst=/txt,type=bind \
   -e NVIDIA_VISIBLE_DEVICES=$CUDA_VISIBLE_DEVICES \
-  -w /src visualjoyce/chengyubert:latest \
+  -w /src visualjoyce/chengyubert:1.6.0-cuda10.1-cudnn7-devel \
   bash -c " PYTHONPATH=/src ${MODEL_PARA} ${HOROVOD_PARA} \\
     python train_${SUB_PROJECT}.py --config=$CONFIG_DIR/$CONFIG_FILE"

chenzhengdeeplearning · 2021-01-04T12:31:25Z

docker run --gpus '"'device=$CUDA_VISIBLE_DEVICES'"' --ipc=host --rm -it
--mount src="$ANNOTATION_DIR",dst=/annotations,type=bind,readonly
--mount src="$TXT_DB",dst=/txt,type=bind
-e NVIDIA_VISIBLE_DEVICES=$CUDA_VISIBLE_DEVICES \

-w /src visualjoyce/chengyubert:latest \

-w /src visualjoyce/chengyubert:1.6.0-cuda10.1-cudnn7-devel
bash -c " PYTHONPATH=/src ${MODEL_PARA} ${HOROVOD_PARA} \
python train_${SUB_PROJECT}.py --config=$CONFIG_DIR/$CONFIG_FILE"
This part is in docker_train.sh

But,What is this mean?
diff --git a/docker_train.sh b/docker_train.sh
index 3788e15..b21df38 100644
--- a/docker_train.sh
+++ b/docker_train.sh
@@ -45,6 +45,6 @@

chenzhengdeeplearning · 2021-01-04T12:34:05Z

It was this before

--mount src="${WORK_DIR}",dst=/src,type=bind \
  --mount src="$OUTPUT",dst=/storage,type=bind \
  --mount src="$PRETRAIN_DIR",dst=/pretrain,type=bind,readonly \
  --mount src=$ANNOTATION_DIR,dst=/annotations,type=bind,readonly \
--mount src="$TXT_DB",dst=/txt,type=bind \

Now is that

--mount src="$ANNOTATION_DIR",dst=/annotations,type=bind,readonly \
  --mount src="$TXT_DB",dst=/txt,type=bind \

Is my understanding right? 3 rows are deleted

VisualJoyce · 2021-01-04T12:44:56Z

A diff file shows editing made, the line starts with - is removed, + means added.

My last post means, you need to change visualjoyce/chengyubert:latest to visualjoyce/chengyubert:1.6.0-cuda10.1-cudnn7-devel.

chenzhengdeeplearning · 2021-01-04T12:49:23Z

Thanks for your patience.

parser.add_argument("--model", default='paired',
                        choices=['snlive'],
                        help="choose from 2 model architecture")

What are paired and snlive mean?

Here is my another error.

 File "train_official.py", line 304, in main
    raise ValueError(f"No such model [{opts.model}] supported!")
ValueError: No such model [paired] supported!

VisualJoyce · 2021-01-04T12:56:26Z

This is due to copy-paste from an earlier code. Now you may ignore the parameter as it's overwritten by MODEL=chengyubert-dual in the command line.

I will fix that on my next version.

chenzhengdeeplearning · 2021-01-04T13:12:18Z

Sorry that there are so many errors I met...
Another error is that. I even don't know why it occurs, because there is only one process in my computer.

Primary job terminated normally, but 1 process returned
a non-zero exit code.. Per user-direction, the job has been aborted.
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated.

VisualJoyce · 2021-01-04T13:27:38Z

That's OK. Thank you for pointing these out!

The code supports multiple GPU. If you only have one GPU, then use CUDA_VISIBLE_DEVICES=0.

chenzhengdeeplearning · 2021-01-04T13:29:54Z

ok!
And what is 'len_idiom_vocab'?

VisualJoyce · 2021-01-04T13:33:13Z

len_idiom_vocab is 3848 for ChID dataset.

Our next paper is under review which supports more than 30k idioms.

So it's a parameter for future compatibility.

chenzhengdeeplearning · 2021-01-04T13:41:43Z

The mode I choose is 'train'.

Your code in train_official.py is this
if opts.mode == 'train':
# data loaders
splits, dataloaders = create_dataloaders(LOGGER, DatasetCls, EvalDatasetCls,
collate_fn, eval_collate_fn, opts, splits=['train', 'val'])
best_ckpt = train(model, dataloaders, opts)
else:
splits = []
for k in dir(opts):
if k.endswith('_txt_db'):

The error is that

AttributeError: 'Namespace' object has no attribute 'train_txt_db'

VisualJoyce · 2021-01-04T13:55:37Z

I can run the code on my machine with

CUDA_VISIBLE_DEVICES=0 CONFIG_FILE="train-official-bert-base-1gpu.json" \
bash docker_train.sh official \
"MODEL=chengyubert-dual ENLARGED_CANDIDATES=1 LEARNING_RATE=0.0001 NUM_TRAIN_STEPS=15003 GRADIENT_ACCUMULATION_STEPS=1 VALID_STEPS=100 GRAD_NORM=1"

Can you post the command line?

chenzhengdeeplearning · 2021-01-04T14:03:00Z

This is mine.
CUDA_VISIBLE_DEVICES=0,1 CONFIG_FILE="train-official-bert-base-1gpu.json" bash docker_train.sh official "MODEL=chengyubert-dual ENLARGED_CANDIDATES=1 LEARNING_RATE=0.0001 NUM_TRAIN_STEPS=15003 GRADIENT_ACCUMULATION_STEPS=1 VALID_STEPS=100 GRAD_NORM=1"

And I run your command, it has the same error...
Is my data directory losing something?
I don't have train_txt_db..

()

VisualJoyce · 2021-01-04T14:05:11Z

Have you done the preprocessing without error?

chenzhengdeeplearning · 2021-01-04T14:17:13Z

Oh, I am so sorry for making wrong with that before.
Thanks for your reminding!

VisualJoyce · 2021-01-04T14:19:07Z

Also, your data/pretrained structure is not the same with the documentation.

chenzhengdeeplearning · 2021-01-04T14:26:39Z

Copy that

chenzhengdeeplearning · 2021-01-04T14:48:58Z

After preprocessing, I have the error too....

chenzhengdeeplearning · 2021-01-04T14:51:05Z

My directories have locked, is that right?

VisualJoyce · 2021-01-04T15:28:33Z

Locked is because they are generated from docker using root. This should be working.

chenzhengdeeplearning · 2021-01-04T15:34:15Z

ok, And..

[1,1]<stderr>:    raise EnvironmentError(msg)
[1,1]<stderr>:OSError: Can't load config for './pretrain/wwm_ext'. Make sure that:
[1,1]<stderr>:
[1,1]<stderr>:- './pretrain/wwm_ext' is a correct model identifier listed on 'https://huggingface.co/models'
[1,1]<stderr>:
[1,1]<stderr>:- or './pretrain/wwm_ext' is the correct path to a directory containing a config.json file

VisualJoyce · 2021-01-04T15:38:03Z

Do you have data/pretrained/wwm_ext? If not, you need to download BERT-wwm-ext from Chinese-BERT-wwm.

VisualJoyce · 2021-01-04T15:39:55Z

Or change the value of pretrained_model_name_or_path in the config file to hfl/chinese-bert-wwm-ext.

chenzhengdeeplearning · 2021-01-04T15:54:48Z

It did work!
My computer doesn't have enough cuda memories. So I change the values of train batch size and val batch size to 2048.
And..
subprocess.CalledProcessError: Command '['git', 'status', '--short']' returned non-zero exit status 128.

What is that mean?

VisualJoyce · 2021-01-04T16:04:15Z

You need to paste the full log, it's hard to tell where the problem might be. When you post the log, try using code block to show the log in a user-friendly format.

chenzhengdeeplearning · 2021-01-04T16:12:35Z

[1,0]<stderr>:01/04/2021 16:11:34 - INFO - __main__ -   Waiting on git info....
[1,0]<stderr>:fatal: not a git repository (or any parent up to mount point /)
[1,0]<stderr>:Stopping at filesystem boundary (GIT_DISCOVERY_ACROSS_FILESYSTEM not set).
[1,0]<stderr>:01/04/2021 16:11:34 - INFO - __main__ -   Git branch: 
[1,0]<stderr>:fatal: not a git repository (or any parent up to mount point /)
[1,0]<stderr>:Stopping at filesystem boundary (GIT_DISCOVERY_ACROSS_FILESYSTEM not set).
[1,0]<stderr>:01/04/2021 16:11:34 - INFO - __main__ -   Git SHA: 
[1,0]<stderr>:fatal: not a git repository (or any parent up to mount point /)
[1,0]<stderr>:Stopping at filesystem boundary (GIT_DISCOVERY_ACROSS_FILESYSTEM not set).
[1,0]<stderr>:Traceback (most recent call last):
[1,0]<stderr>:  File "train_official.py", line 468, in <module>
[1,0]<stderr>:    main(args)
[1,0]<stderr>:  File "train_official.py", line 317, in main
[1,0]<stderr>:    best_ckpt = train(model, dataloaders, opts)
[1,0]<stderr>:  File "train_official.py", line 49, in train
[1,0]<stderr>:    save_training_meta(opts)
[1,0]<stderr>:  File "/src/chengyubert/utils/save.py", line 45, in save_training_meta
[1,0]<stderr>:    cwd=git_dir, universal_newlines=True).strip()
[1,0]<stderr>:  File "/opt/conda/lib/python3.7/subprocess.py", line 411, in check_output
[1,0]<stderr>:    **kwargs).stdout
[1,0]<stderr>:  File "/opt/conda/lib/python3.7/subprocess.py", line 512, in run
[1,0]<stderr>:    output=stdout, stderr=stderr)
[1,0]<stderr>:subprocess.CalledProcessError: Command '['git', 'status', '--short']' returned non-zero exit status 128.

Thanks you very much!

VisualJoyce · 2021-01-04T16:13:59Z

Are you using a cloned repo or downloaded the zip from master?

chenzhengdeeplearning · 2021-01-04T16:16:29Z

Yes, I downloaded the zip before

VisualJoyce · 2021-01-04T16:18:21Z

The code is trying to query the git info and failed. Either you clone the repo or comment out the line which is trying to query git status.

chenzhengdeeplearning · 2021-01-04T16:31:25Z

It finally works!
Thank you sooooo much!!!

VisualJoyce · 2021-01-04T17:25:04Z

Glad that works! I will add more details on next release.

chenzhengdeeplearning closed this as completed Jan 4, 2021

VisualJoyce reopened this Jan 4, 2021

VisualJoyce changed the title ~~docker: Error response from daemon: requirement error: unsatisfied condition: cuda>=11.0, please update your driver to a newer version, or use an earlier cuda container: unknown.~~ Problems met when trying the code Jan 4, 2021

VisualJoyce pinned this issue Jan 4, 2021

VisualJoyce added the documentation Improvements or additions to documentation label Jan 4, 2021

VisualJoyce closed this as completed Jan 4, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Problems met when trying the code #2

Problems met when trying the code #2

chenzhengdeeplearning commented Jan 4, 2021 •

edited by VisualJoyce

Loading

VisualJoyce commented Jan 4, 2021

chenzhengdeeplearning commented Jan 4, 2021

VisualJoyce commented Jan 4, 2021

chenzhengdeeplearning commented Jan 4, 2021

chenzhengdeeplearning commented Jan 4, 2021 •

edited by VisualJoyce

Loading

VisualJoyce commented Jan 4, 2021

chenzhengdeeplearning commented Jan 4, 2021 •

edited by VisualJoyce

Loading

VisualJoyce commented Jan 4, 2021

chenzhengdeeplearning commented Jan 4, 2021 •

edited

Loading

VisualJoyce commented Jan 4, 2021

chenzhengdeeplearning commented Jan 4, 2021

VisualJoyce commented Jan 4, 2021

chenzhengdeeplearning commented Jan 4, 2021

VisualJoyce commented Jan 4, 2021

chenzhengdeeplearning commented Jan 4, 2021

VisualJoyce commented Jan 4, 2021

chenzhengdeeplearning commented Jan 4, 2021

VisualJoyce commented Jan 4, 2021

chenzhengdeeplearning commented Jan 4, 2021

chenzhengdeeplearning commented Jan 4, 2021

chenzhengdeeplearning commented Jan 4, 2021

VisualJoyce commented Jan 4, 2021

chenzhengdeeplearning commented Jan 4, 2021 •

edited by VisualJoyce

Loading

VisualJoyce commented Jan 4, 2021

VisualJoyce commented Jan 4, 2021

chenzhengdeeplearning commented Jan 4, 2021

VisualJoyce commented Jan 4, 2021

chenzhengdeeplearning commented Jan 4, 2021 •

edited by VisualJoyce

Loading

VisualJoyce commented Jan 4, 2021

chenzhengdeeplearning commented Jan 4, 2021

VisualJoyce commented Jan 4, 2021

chenzhengdeeplearning commented Jan 4, 2021

VisualJoyce commented Jan 4, 2021

Problems met when trying the code #2

Problems met when trying the code #2

Comments

chenzhengdeeplearning commented Jan 4, 2021 • edited by VisualJoyce Loading

VisualJoyce commented Jan 4, 2021

chenzhengdeeplearning commented Jan 4, 2021

VisualJoyce commented Jan 4, 2021

chenzhengdeeplearning commented Jan 4, 2021

chenzhengdeeplearning commented Jan 4, 2021 • edited by VisualJoyce Loading

VisualJoyce commented Jan 4, 2021

chenzhengdeeplearning commented Jan 4, 2021 • edited by VisualJoyce Loading

VisualJoyce commented Jan 4, 2021

chenzhengdeeplearning commented Jan 4, 2021 • edited Loading

VisualJoyce commented Jan 4, 2021

chenzhengdeeplearning commented Jan 4, 2021

VisualJoyce commented Jan 4, 2021

chenzhengdeeplearning commented Jan 4, 2021

VisualJoyce commented Jan 4, 2021

chenzhengdeeplearning commented Jan 4, 2021

VisualJoyce commented Jan 4, 2021

chenzhengdeeplearning commented Jan 4, 2021

VisualJoyce commented Jan 4, 2021

chenzhengdeeplearning commented Jan 4, 2021

chenzhengdeeplearning commented Jan 4, 2021

chenzhengdeeplearning commented Jan 4, 2021

VisualJoyce commented Jan 4, 2021

chenzhengdeeplearning commented Jan 4, 2021 • edited by VisualJoyce Loading

VisualJoyce commented Jan 4, 2021

VisualJoyce commented Jan 4, 2021

chenzhengdeeplearning commented Jan 4, 2021

VisualJoyce commented Jan 4, 2021

chenzhengdeeplearning commented Jan 4, 2021 • edited by VisualJoyce Loading

VisualJoyce commented Jan 4, 2021

chenzhengdeeplearning commented Jan 4, 2021

VisualJoyce commented Jan 4, 2021

chenzhengdeeplearning commented Jan 4, 2021

VisualJoyce commented Jan 4, 2021

chenzhengdeeplearning commented Jan 4, 2021 •

edited by VisualJoyce

Loading

chenzhengdeeplearning commented Jan 4, 2021 •

edited by VisualJoyce

Loading

chenzhengdeeplearning commented Jan 4, 2021 •

edited by VisualJoyce

Loading

chenzhengdeeplearning commented Jan 4, 2021 •

edited

Loading

chenzhengdeeplearning commented Jan 4, 2021 •

edited by VisualJoyce

Loading

chenzhengdeeplearning commented Jan 4, 2021 •

edited by VisualJoyce

Loading