Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Running problem on Linux server #342

Open
looperalt opened this issue Dec 23, 2024 · 9 comments
Open

Running problem on Linux server #342

looperalt opened this issue Dec 23, 2024 · 9 comments
Assignees

Comments

@looperalt
Copy link

When I run iBVPnet on a Linux server, the following problems occur:
TypeError: Binding inputs to tf.function wrapped_fn failed due to Can not cast TensorSpec(shape=(1, 1024, 1365, 4), dtype=tf.float32, name=None) to TensorSpec(shape=(None, None, None, 3), dtype=tf.float32, name=None). Received args: (array([[[[149. , 155. , 163. , 0. ]
and
ValueError: ('train', 'No files in file list')

@yahskapar
Copy link
Collaborator

Hi @looperalt,

My guess is something went wrong when trying to preprocess the dataset - did you check your terminal before that last error message to see if preprocessing was actually successful (e.g., the progress bar for preprocessing showed up, was completed, etc)? Remember, you have to preprocess the dataset for the first time and anytime you change key parameters that would affect preprocessing in your config file (refer to the example config files for more details). Also, note the project README and how datasets are expected to be organized (in most cases, the way they are downloaded).

If it seems like preprocessing somehow starts and then abruptly stops, you could try seeing if the issue is the default multi_process_quota configuration here. Try setting that to a lower value (perhaps 1 to begin with).

@yahskapar yahskapar self-assigned this Dec 23, 2024
@looperalt
Copy link
Author

Hi @looperalt,

My guess is something went wrong when trying to preprocess the dataset - did you check your terminal before that last error message to see if preprocessing was actually successful (e.g., the progress bar for preprocessing showed up, was completed, etc)? Remember, you have to preprocess the dataset for the first time and anytime you change key parameters that would affect preprocessing in your config file (refer to the example config files for more details). Also, note the project README and how datasets are expected to be organized (in most cases, the way they are downloaded).

If it seems like preprocessing somehow starts and then abruptly stops, you could try seeing if the issue is the default multi_process_quota configuration here. Try setting that to a lower value (perhaps 1 to begin with).

Yes, a progress bar appears during preprocessing, but when the progress bar reaches the end, this issue occurs. However, it's strange that when I run the code locally on Windows, there is no problem. When I run it on a Linux remote server, this issue arises. I have followed the process outlined in the README file without any operational errors, and I am puzzled as to why there would be different results when running the code on different systems. I would like to know if you have encountered this issue and how to adjust the parameters when running this code on a Linux remote server

@yahskapar
Copy link
Collaborator

yahskapar commented Dec 23, 2024

@looperalt,

If you can run it locally on a Windows machine but somehow not on a Linux server, are you sure there isn't something going wrong with how your dataset (whether the dataset itself or the preprocessed folder) is being pointed to in the Linux case? I guess if the Linux remote server has too much CPU usage going on, that could also lead to stuck or dead processes that prevent successful preprocessing, but if you use top and similar commands you may find that isn't likely and that you should be fine in that regard.

Feel free to share the config you're trying to run here, perhaps there is something up with the file paths that I can identify.

@looperalt
Copy link
Author

@looperalt,

If you can run it locally on a Windows machine but somehow not on a Linux server, are you sure there isn't something going wrong with how your dataset (whether the dataset itself or the preprocessed folder) is being pointed to in the Linux case? I guess if the Linux remote server has too much CPU usage going on, that could also lead to stuck or dead processes that prevent successful preprocessing, but if you use top and similar commands you may find that isn't likely and that you should be fine in that regard.

Feel free to share the config you're trying to run here, perhaps there is something up with the file paths that I can identify.

When running on the Linux server, the progress bar advances normally, but after the progress bar is full, it displays a ValueError: ('train', 'No files in file list') and exits the program, indicating that only the dataset was read but not preprocessed or saved. This proves that the dataset path is not the issue, and it might be a problem caused by memory killing. Can this issue be resolved? I don't know if you have experience in dealing with such issues. I will try to share my configuration file with you tomorrow. Thank you for your answer, and I am very grateful.

@yahskapar
Copy link
Collaborator

@looperalt,
If you can run it locally on a Windows machine but somehow not on a Linux server, are you sure there isn't something going wrong with how your dataset (whether the dataset itself or the preprocessed folder) is being pointed to in the Linux case? I guess if the Linux remote server has too much CPU usage going on, that could also lead to stuck or dead processes that prevent successful preprocessing, but if you use top and similar commands you may find that isn't likely and that you should be fine in that regard.
Feel free to share the config you're trying to run here, perhaps there is something up with the file paths that I can identify.

When running on the Linux server, the progress bar advances normally, but after the progress bar is full, it displays a ValueError: ('train', 'No files in file list') and exits the program, indicating that only the dataset was read but not preprocessed or saved. This proves that the dataset path is not the issue, and it might be a problem caused by memory killing. Can this issue be resolved? I don't know if you have experience in dealing with such issues. I will try to share my configuration file with you tomorrow. Thank you for your answer, and I am very grateful.

If it was read and not preprocessed or saved, again, try two things 1) adjust multi_process-quota to a lower value, perhaps 1 to begin with, and see if that helps and 2) check your preprocessed data save path itself and make sure you actually have permissions to write to it.

If 1) does not make a difference at all, let me know, and we can dig into other things that may be specific to your situation and causing issues. I should note, the majority of toolbox users (i.e., hundreds of people, myself included) use Linux and the default multi-process setting without any issue, so troubleshooting with respect to your particular remote server is the way to go.

@looperalt
Copy link
Author

@looperalt,
If you can run it locally on a Windows machine but somehow not on a Linux server, are you sure there isn't something going wrong with how your dataset (whether the dataset itself or the preprocessed folder) is being pointed to in the Linux case? I guess if the Linux remote server has too much CPU usage going on, that could also lead to stuck or dead processes that prevent successful preprocessing, but if you use top and similar commands you may find that isn't likely and that you should be fine in that regard.
Feel free to share the config you're trying to run here, perhaps there is something up with the file paths that I can identify.

When running on the Linux server, the progress bar advances normally, but after the progress bar is full, it displays a ValueError: ('train', 'No files in file list') and exits the program, indicating that only the dataset was read but not preprocessed or saved. This proves that the dataset path is not the issue, and it might be a problem caused by memory killing. Can this issue be resolved? I don't know if you have experience in dealing with such issues. I will try to share my configuration file with you tomorrow. Thank you for your answer, and I am very grateful.

If it was read and not preprocessed or saved, again, try two things 1) adjust multi_process-quota to a lower value, perhaps 1 to begin with, and see if that helps and 2) check your preprocessed data save path itself and make sure you actually have permissions to write to it.

If 1) does not make a difference at all, let me know, and we can dig into other things that may be specific to your situation and causing issues. I should note, the majority of toolbox users (i.e., hundreds of people, myself included) use Linux and the default multi-process setting without any issue, so troubleshooting with respect to your particular remote server is the way to go.

I have tried the methods you suggested, continuously adjusting the parameters of the multi-process function, but unfortunately, the error still persists.
Preprocessing dataset...
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [04:46<00:00, 71.68s/it]
Traceback (most recent call last):
File "main.py", line 177, in
train_data_loader = train_loader(
File "/03/Datasets/rppgt/dataset/data_loader/iBVPLoader.py", line 50, in init
super().init(name, data_path, config_data)
File "/03/Datasets/rppgt/dataset/data_loader/BaseLoader.py", line 68, in init
self.preprocess_dataset(self.raw_data_dirs, config_data.PREPROCESS, config_data.BEGIN, config_data.END)
File "/03/Datasets/rppgt/dataset/data_loader/BaseLoader.py", line 209, in preprocess_dataset
self.build_file_list(file_list_dict) # build file list
File "/03/Datasets/rppgt/dataset/data_loader/BaseLoader.py", line 531, in build_file_list
raise ValueError(self.dataset_name, 'No files in file list')
ValueError: ('train', 'No files in file list')

@yahskapar
Copy link
Collaborator

Ok, so we've ruled out multi-processing as far as too many processes being the issue.

Here's a few more things to try:

  1. The preprocessed dataset path on the remote server, CACHED_PATH, can you actually write to it on the remote server? Are you able to make files in it (e.g., touch example.txt) and modify those files? It may help to share your config file at this point just to check for anything subtle (e.g., anything weird in the cached file path itself).

  2. Assuming you're trying to use the iBVP dataset based on the error message you pasted, add a simple print statement on the line here, with the same indentation level as the else statement above the line. Print out some debug details about the output of read_video(), for example you can use np.shape as a quick check to make sure the at least the video read is being done successfully. Are sane results returned (e.g., number of frames with dimensions)?

  3. Can you also try re-running the repo setup instructions? This should delete your existing conda environment and make a new one. I would carefully inspect the setup outputs and make sure no errors appeared - depending on the error, there could be a subtle effect on preprocessing with respect to things like face detection.

All the best,

Akshay

@looperalt
Copy link
Author

Ok, so we've ruled out multi-processing as far as too many processes being the issue.

Here's a few more things to try:

  1. The preprocessed dataset path on the remote server, CACHED_PATH, can you actually write to it on the remote server? Are you able to make files in it (e.g., touch example.txt) and modify those files? It may help to share your config file at this point just to check for anything subtle (e.g., anything weird in the cached file path itself).
  2. Assuming you're trying to use the iBVP dataset based on the error message you pasted, add a simple print statement on the line here, with the same indentation level as the else statement above the line. Print out some debug details about the output of read_video(), for example you can use np.shape as a quick check to make sure the at least the video read is being done successfully. Are sane results returned (e.g., number of frames with dimensions)?
  3. Can you also try re-running the repo setup instructions? This should delete your existing conda environment and make a new one. I would carefully inspect the setup outputs and make sure no errors appeared - depending on the error, there could be a subtle effect on preprocessing with respect to things like face detection.

All the best,

Akshay

I have confirmed that my preprocessing folder has write permissions. In fact, there is something very strange: when I first tried to run the code, .npy files appeared in the preprocessing folder, but when I closed the program and ran it again, the .npy files never appeared again. This is a very confusing situation for me.

@yahskapar
Copy link
Collaborator

I would troubleshoot that a bit more, that does sound strange and it's really hard for me to tell what the issue might be since it sounds quite specific to your remote server / your environment. Is it safe to assume, using df -h or similar, you aren't running out of storage space on your remote server somehow?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants