-
Notifications
You must be signed in to change notification settings - Fork 359
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
chore: remove e2e_slurm_gpu series tests #10021
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #10021 +/- ##
==========================================
- Coverage 54.59% 54.58% -0.01%
==========================================
Files 1259 1259
Lines 157245 157243 -2
Branches 3620 3620
==========================================
- Hits 85843 85831 -12
- Misses 71269 71279 +10
Partials 133 133
Flags with carried forward coverage won't be shown. Click here to find out more.
|
✅ Deploy Preview for determined-ui canceled.
|
acd6929
to
ecf0aac
Compare
@@ -47,7 +47,6 @@ def wait_for_gc_to_finish(sess: api.Session, experiment_ids: List[int]) -> None: | |||
|
|||
|
|||
@pytest.mark.e2e_gpu | |||
@pytest.mark.e2e_slurm_gpu | |||
def test_set_gc_policy() -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wut. Why was this ever a gpu test. Seriously.
Instead of demoting this test to e2e_slurm, I moved the e2e_slurm to test_delete_checkpoints which actually tests that checkpoint_gc works.
(I almost deleted test_set_gc_policy()
last week just because it doesn't check the results of setting gc policy, it only makes sure the CLI doesn't crash(!))
@@ -121,6 +120,7 @@ def test_gc_checkpoints_lfs() -> None: | |||
|
|||
|
|||
@pytest.mark.e2e_cpu | |||
@pytest.mark.e2e_slurm |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm pretty open to not even testing checkpoint gc on slurm.
afaict the only way it actually fails is if bind mounts breaks, and that seems like we probably test it sufficiently elsewhere.
But I added the test because we don't have a good way to expose the gc failure in general, so having one explicit test seems like the right choice.
@@ -12,7 +12,6 @@ | |||
|
|||
|
|||
@pytest.mark.e2e_gpu | |||
@pytest.mark.e2e_slurm_gpu |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is testing a rest API, which does not care about the resource manager.
The only thing in python code that could be tested is "can you talk to a GPU", but I think there is basically zero chance that this test fails if gpu training succeeds.
ecf0aac
to
2b6932f
Compare
@pytest.mark.parallel | ||
@pytest.mark.e2e_slurm_gpu | ||
def test_pytorch_gradient_aggregation() -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test was written before we had parallel gpu tests. We didn't have a way to validate our gradient aggregation logic (which is ours, not just a library we call), other than an e2e test.
It never needed a slurm test.
This test has been obsoleted by multi-gpu unit testing.
@@ -7,14 +7,13 @@ | |||
|
|||
@pytest.mark.distributed | |||
@pytest.mark.gpu_required | |||
@pytest.mark.e2e_slurm_gpu |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can get behind testing our pytorch2 images (so I did not delete the whole test), but I see absolutely no reason why this needs to run on slurm.
def test_pytorch2_hf_language_modeling_distributed() -> None: | ||
sess = api_utils.user_session() | ||
test_dir = "hf_language_modeling" | ||
|
||
config = conf.load_config(conf.hf_trainer_examples_path(f"{test_dir}/distributed.yaml")) | ||
config = conf.set_pt2_image(config) | ||
config = conf.set_slots_per_trial(config, 4) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bring this gpu test in line with the slots per trial common to our other distributed gpu tests.
@@ -47,7 +47,6 @@ def wait_for_gc_to_finish(sess: api.Session, experiment_ids: List[int]) -> None: | |||
|
|||
|
|||
@pytest.mark.e2e_gpu |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does this need to be gpu even?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same for test_gc_checkpoints
here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same for test_s3_no_creds
(though it's being skipped it seems)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that might be a question for @stoksc, I think maybe that's to run those tests on fewer clusters?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wouldn't e2e_cpu
cause it to run on fewer clusters?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From an infra perspective, this is fine. Developers needed for the other parts.
not necessarily related to this PR, but i'm looking at kind of a weird mark? why are we running gpu tests if they're not required? seems like at least some of them are what the comment says, using it to actually run k8 CPU tests. since we have i guess the question is how many of these tests actually need GPU/need to be tested on specific architectures. |
2b6932f
to
be035b4
Compare
Note that there are nightly tests decorated with: - @e2e_slurm - skipif(not torch.cuda.is_available()) So we still have some GPU-specific slurm tests at this point. But those tests were not actually running as part of the e2e_slurm_gpu tests anyway. This is part of a larger effort to get rid of our znode tests, which are notoriously unreliable.
be035b4
to
ef348df
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🔥
Note that there are nightly tests decorated with:
So we still have some GPU-specific slurm tests at this point. But those tests were not actually running as part of the e2e_slurm_gpu tests anyway.
This is part of a larger effort to get rid of our znode tests, which are notoriously unreliable.