Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Revise testing infrastructure to decrease spurious failures #759

Open
wants to merge 18 commits into
base: master
Choose a base branch
from

Conversation

cboettig
Copy link
Member

@cboettig cboettig commented Jan 31, 2024

The testing infrastructure fails to successfully test any cuda images due to disk space limitations. This separates out the testing of cuda images from other scripts.

Tests involving rstudio-daily are also failing continuously due to server issues with the downloads.

Both of these create conditions that guarantee test failure, making it impossible for PRs to satisfy all checks. This either prevents PR contributions or means that PR contributions are merged with some failing tests, neither of which is a good solution.

@cboettig
Copy link
Member Author

@eitsupi I think all the tests in https://github.com/rocker-org/rocker-versioned2/actions/runs/7721419467/workflow?pr=759 run on the same runner? public repo runners for linux have 150 GB. it looks like just most small change triggers the full matrix of tests, and at a the moment that full matrix of tests just can't actually run on the runner, as there isn't space? e.g in this initial test PR which shows 6 of the 38 tests failing. They fail for various reasons, though none are related to the change in the PR, they are either network issues or lack of space.

I think your plans in #755 to better leverage cache in the redesign build infrastructure will greatly improve this situation. But meanwhile I think it's difficult to make a meaningful PR against the repo that won't hit failed tests for unrelated issues, especially the disk space error.

I will try testing out some options here for a potentially slimmer test matrix here as a work-around...

@eitsupi
Copy link
Member

eitsupi commented Jan 31, 2024

As I commented elsewhere, the capacity issue should be resolved by removing unnecessary software.
For example:

- name: Clean up
run: |
docker image prune --all --force

I think this is affected by a change made a while ago that increased the rocker/cuda image's capacity by several gigabytes.

I think reducing tests and merging incorrect changes is just as bad an idea as ignoring and merging tests that randomly fail.

@cboettig
Copy link
Member Author

@eitsupi Thanks for your help here. To be clear, we are entirely on the same page about not reducing testing and not merging incorrect changes. Having unrelated tests fail for unrelated reasons does not reduce incorrect changes. The current testing does not cover most cases anyway, and I'd like to actually add more tests to get better coverage, not less. As I said at the top of this, I'm not seeking to remove tests overall, I'm testing the removal of tests here to try and get a handle on disk use so I can add tests. We are on the same objective here.

I cannot currently fix things that are broken and have been broken for a long time and have never been covered by our tests while PRs are throwing errors that are entirely unrelated to those changes and are also not reflective of problems either in the existing stack or the proposed changes.

I don't understand the solution you are proposing -- that we shrink the size of the images themselves or that we free space from other software on the host runner? You suggested removing arrow libraries I think? As I noted in #756, adding support for tensorflow and torch libraries adds 4 - 5 GB each. If you are aware of unnecessary software that could free enough extra space to test ML libraries then a PR would be awesome. The test images should have 150 GB of disk. We should be able to run tests on the 13 GB base cuda image. While the matrix setup is nice, perhaps it would avoid these issues to have tests handled on different runners?

The current test design is not compatible with testing the large images involved in the machine learning stack. I suggest we move that testing to a separate runner. We may want it on a separate runner anyway so we have the option of self-hosted runner setup to run these images on GPU.

@eitsupi
Copy link
Member

eitsupi commented Jan 31, 2024

Sorry I didn't explain better, but my point was that we can free up space by deleting unnecessary stuff on the runner, which is what Apache Arrow's main repository does thoroughly, and I think the script below does just that.
https://github.com/apache/arrow/blob/787afa1594586d2d556d21471647f9cd2c55b18f/ci/scripts/util_free_space.sh

If I remember correctly, all testing is done on a separate runner, so the reason why the test for the ML image fails is simply because that image is too huge.

daily is already being built daily, and it has been failing for a week anyway. It should not be tested on PRs that only touch unconnected parts of the stack
@cboettig
Copy link
Member Author

cboettig commented Feb 1, 2024

In the above edits, I have moved cuda images out of the tests/rocker_scripts/matrix.json, since that runner simply does not have enough space to test anything involving the now 13 GB cuda base image, let alone the potentially larger derivative images.

I've moved cuda into a separate workflow. I initially structured this off the same design as the rocker_scripts test, but even with a single test there it doesn't have enough space. I don't really understand why -- maybe buildkit is using additional space? I rewrote the action to concise simple docker build test, which runs just fine. Anyway I think this is the correct direction to go in -- after all, as I mentioned, I'd like to consider actually testing cuda images on GPU machines with self-hosted runners.

@eitsupi It would be great if you'd like to add those image-size-reducing changes from arrow, I'm reluctant to paste them in here as it adds complexity to the build system and I don't really understand what it is doing. (For instance, I don't see how it can really free 20 GB from every ubuntu 22.04 runner, when according to the GtiHub docs runners on private repos don't even have 20 GB SSDs to begin with...)

Anyway, I'd like to move forward on this so we can get testing unstuck for #756 and so we can start providing the main ML frameworks on our ML-tagged images.

@cboettig cboettig marked this pull request as ready for review February 1, 2024 01:42
Copy link
Member

@eitsupi eitsupi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for looking into this.

Please update the PR title and discription.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any reason why we should keep this in a separate yml file?
Is splitting another job not enough?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, not entirely sure I follow. I think having a separate script is necessary to run on another runner? I find this architectural design easier to understand and maintain -- it's fewer pieces and distinct parts are isolated in different files. By splitting another job you mean doing this through the matrix.json file? I did look at the design of a matrix.json for this (in a separate yaml file, though perhaps could be merged): https://github.com/rocker-org/rocker-versioned2/pull/759/files#diff-454bb856f3821991ff2015ec5ba81c69df2ae3b5a476e30104d697c420b35093, and it hits the same issue with space.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My suggestion is simply that it is impossible to move the job to scripts-test.yml.

I don't know if they're sharing capacity on a per-job basis or on a per-workflow basis, but I think it's unlikely that they're sharing it across workflows.
So you probably don't need to split the workflow for a single job here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh sorry to be dense, I think I understand. I've always tended to write actions with basically one job per yaml, unless there were two dependent jobs (e.g. like generate_matrix and build scripts-test.yml). I think I do the same thing with, e.g. writing R functions in many different files in R/. I see the cuda tests as a bit of a work in progress, while for the moment scripts-test.yml has been pretty stable, so conceptually to me anyway this provides a bit more separation between the component I'm still intending to fiddle with and the component that works. Why have all the jobs in the same yaml file?

.github/workflows/cuda-test.yml Outdated Show resolved Hide resolved
Comment on lines 2 to 7
on:
workflow_dispatch: null
push:
paths:
- tests/ml-test.Dockerfile
- .github/workflows/cuda-test.yml
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please reconsider.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

walk me through your thinking? if we change the workflow or the test, we should run the workflow, right? or is the former implicit anyway?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Like this:

on:
pull_request:
branches:
- master
paths:
- tests/rocker_scripts/Dockerfile
- tests/rocker_scripts/matrix.json
- tests/rocker_scripts/test.sh
- scripts/*.sh
- "!scripts/install_R_*.sh"
- "!scripts/setup_R.sh"
workflow_dispatch:

  • I don't think push should be used without specifying a tag or branch.
  • The actual target we want to test is the PRs, not pushes.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh thanks for explaining, got it! done now.

jobs:
build:
runs-on: ubuntu-latest
permissions: write-all
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

write-all is needed here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good question, let's try without that.

Copy link
Member

@eitsupi eitsupi Feb 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry. What I was trying to say was that read permission would be sufficient.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But I don't think this test is ever using the GITHUB_TOKEN here anyway? I think I had left write-all in by mistake from a previous action that was also pushing to github (and even then it should plausibly have a more scoped write permission). I think the test is happy without this.

Copy link
Member

@eitsupi eitsupi Feb 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this repository is old, I assume write-all privileges are granted by default, but if you follow best practices, you should grant only minimal privileges.
Since we are only reading files here, I expect read permission to be sufficient.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh good to know! I didn't realize they did legacy privileges but of course that makes sense or actions would break. I've gone with read-all now.

tests/ml-test.Dockerfile Outdated Show resolved Hide resolved
@@ -33,7 +32,7 @@
"base_image": "rocker/r-ver",
"tag": "devel",
"script_name": "install_rstudio.sh",
"script_arg": "daily"
"script_arg": "latest"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems unrelated change.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As you know, daily has been failing for a week due the RStudio server throwing a 500 error on that download. Because we try to build that image daily in the cron tasks anyway, I think it's less important to also test it here, especially when it means that the current situation ends up blocking the ability to fix anything. Maybe you can provide your preferred solution?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I'm sorry. I didn't know that the daily build was failing. Indeed, as you say, this is used every day, so we don't have to dare to test it here.

I'm sure that "latest" was tested elsewhere (because latest is the default), so I think it's okay to delete the test itself.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok thanks. yeah sorry I get the emails every day when the daily fails (https://github.com/rocker-org/rocker-versioned2/actions/workflows/devel.yml) which happens all the time. It looked like this test was building rstudio on r-ver:devel while other tests were building it on the normal r-ver so I didn't delete it, but if you're fine dropping this test then so am I!

tests/rocker_scripts/matrix.json Show resolved Hide resolved
@cboettig cboettig changed the title testing the tests Revise testing infrastructure to decrease spurious failures Feb 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants