Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

first pass of gpu smoke test #281

Open
wants to merge 6 commits into
base: master
Choose a base branch
from
Open

Conversation

RobJY
Copy link
Contributor

@RobJY RobJY commented Nov 5, 2021

Script collects version of driver, toolkit and tensorflow in a way that doesn't require communication with the card

Script collects version of driver, toolkit and tensorflow in a way that doesn't require communication with the card
@RobJY
Copy link
Contributor Author

RobJY commented Nov 5, 2021

Starting this PR to start the discussion about another attempt at building a script for a build level smoke test for gpu functionality. The committed script test-config-cuda.sh collects the driver, toolkit and tensorflow versions in a way that shouldn't rely on communication with the gpu card, so should be available during the build.

We could then have a JSON file of tested version combinations that this script would compare against and raise a warning if the current combination isn't on the list. This would again be run by config_R_cuda.sh like our last attempt did.

It adds the maintenance burden of keeping up the JSON file, but may help to catch issues during the build.

I'll also work on another PR that will have an Ansible playbook to spin up an EC2 gpu instance, run the stand alone gpu test script in tests/gpu/misc/test-gpu.sh, and then shut down the EC2 instance. This could be used to test new conbinations and populate the above mentioned JSON file. Ideally the playbook would be triggered by a manual GitHub action.

How does this sound?

Copy link
Member

@eitsupi eitsupi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I created the scripts/tests directory before, do you think test-config-cuda.sh would be better placed under the directory?

And, I was wondering about a few details about scripts, so I commented on them. If they are inappropriate, please ignore them.

scripts/test-config-cuda.sh Outdated Show resolved Hide resolved
echo $VERSION_TOOLKIT

# tensorflow
if ! VERSION_TF_OUTPUT=`python -c 'import tensorflow as tf; print(tf.__version__)' 2>&1`;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

$(...) may be preferable to `...`. https://github.com/koalaman/shellcheck/wiki/SC2006

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @eitsupi! I made the change as you suggested in my latest commit. I didn't realize backticks were deprecated. I'll try to use parentheses going forward.

@@ -0,0 +1,36 @@

## Tensorflow:
install.packages('keras', repos='http://cran.us.r-project.org')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any reason to use a US CRAN mirror?

CRAN=${CRAN:-https://cran.r-project.org}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@RobJY is this to trigger source installation?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we need this line at all since tensorflow is installed in the containers by scripts/install_tensorflow.sh?
When I attempt to run exampls_tf.R without it though I get the following error:

Error in library(keras) : there is no package called ‘keras’
Execution halted

Running library(tensorflow) in R gives a similar error message.

Running scripts/test-config-cuda.sh reports the correct tensorflow version, but it's checking the version from Python with python -c 'import tensorflow as tf; print(tf.__version__)'.

Do I need to add the path where tensorflow is installed by scripts/install_tensorflow.sh somewhere so R sees it?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I wanted to point out is that the repos argument may simply be unnecessary.

Copy link
Contributor Author

@RobJY RobJY Jan 7, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes @eitsupi, you're right. It worked fine when I removed repos. Thanks! I've committed that change.

It still seems like it shouldn't need to install that though. I tried adding the path /opt/venv/reticulate with the following code, but I got the same error:

old_path <- Sys.getenv("PATH")
Sys.setenv(PATH = paste(old_path, "/opt/venv/reticulate", sep = ":"))

Is it fine to install keras here or does the fact that I need to install it indicate that there's an issue with the tensorflow install?

@RobJY
Copy link
Contributor Author

RobJY commented Mar 10, 2022

I have a question that may have gotten lost in a comment above.

Is it ok to install keras here (scripts/tests/examples_tf.R) or does the fact that I need to install it indicate that there's an issue with the tensorflow install? It seems like keras should already be installed, but I get an error if I just run library(keras) without installing first.

@cboettig
Copy link
Member

I think it makes sense to install keras here, as the python package are not installed in the base image (see install_tensorflow.sh)

I've tried to move away from the idea of a 'global' venv with python modules pre-installed, since in my experience different users / different projects often want their own versions of all the modules.

I think we need to remove the multi-user venv setup created by setting WORKON_HOME, https://github.com/rocker-org/rocker-versioned2/blob/master/scripts/install_python.sh#L7, since this too easily leads to permission conflicts with mutliple users installing/upgrading packages in a shared environment. I think it would be okay for us to create a default environment in the default user's home dir (i.e. ~/rstudio) instead, but using /opt/venv was a bad idea. R is happy to draw packages across multiple local libraries; e.g. a user can access packages already installed in R_HOME with read-only privileges while adding new/upgraded packages to their home dir, ~/R, but as this doesn't work with python venv, there seems less point in having the shared library pre-installed...

@RobJY
Copy link
Contributor Author

RobJY commented Mar 11, 2022

Oh, I see. That makes sense to me now. Thank you for explaining @cboettig!

I think this PR is ready to go. Please let me know if you'd like any other changes.

@cboettig
Copy link
Member

@noamross can you give this another look over too?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants