chore: remove e2e_slurm_gpu series tests #10021

rb-determined-ai · 2024-10-04T16:20:47Z

Note that there are nightly tests decorated with:

@e2e_slurm
skipif(not torch.cuda.is_available())

So we still have some GPU-specific slurm tests at this point. But those tests were not actually running as part of the e2e_slurm_gpu tests anyway.

This is part of a larger effort to get rid of our znode tests, which are notoriously unreliable.

codecov · 2024-10-04T16:21:00Z

Codecov Report

Attention: Patch coverage is 0% with 1 line in your changes missing coverage. Please review.

Project coverage is 54.58%. Comparing base (a0cc818) to head (ef348df).
Report is 15 commits behind head on main.

Files with missing lines	Patch %	Lines
...ess/tests/experiment/pytorch/test_pytorch_trial.py	0.00%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main   #10021      +/-   ##
==========================================
- Coverage   54.59%   54.58%   -0.01%     
==========================================
  Files        1259     1259              
  Lines      157245   157243       -2     
  Branches     3620     3620              
==========================================
- Hits        85843    85831      -12     
- Misses      71269    71279      +10     
  Partials      133      133

Flag	Coverage Δ
backend	`45.33% <ø> (-0.03%)`	⬇️
harness	`72.74% <0.00%> (+<0.01%)`	⬆️
web	`54.34% <ø> (ø)`

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines	Coverage Δ
...ess/tests/experiment/pytorch/test_pytorch_trial.py	`89.21% <0.00%> (+0.28%)`	⬆️

... and 4 files with indirect coverage changes

netlify · 2024-10-04T16:21:03Z

✅ Deploy Preview for determined-ui canceled.

Name	Link
🔨 Latest commit	`ef348df`
🔍 Latest deploy log	https://app.netlify.com/sites/determined-ui/deploys/6706adc11bd4ff0008080919

rb-determined-ai · 2024-10-04T16:24:10Z

e2e_tests/tests/cluster/test_checkpoints.py

@@ -47,7 +47,6 @@ def wait_for_gc_to_finish(sess: api.Session, experiment_ids: List[int]) -> None:


 @pytest.mark.e2e_gpu
-@pytest.mark.e2e_slurm_gpu
 def test_set_gc_policy() -> None:


wut. Why was this ever a gpu test. Seriously.

Instead of demoting this test to e2e_slurm, I moved the e2e_slurm to test_delete_checkpoints which actually tests that checkpoint_gc works.

(I almost deleted test_set_gc_policy() last week just because it doesn't check the results of setting gc policy, it only makes sure the CLI doesn't crash(!))

rb-determined-ai · 2024-10-04T16:25:30Z

e2e_tests/tests/cluster/test_checkpoints.py

@@ -121,6 +120,7 @@ def test_gc_checkpoints_lfs() -> None:


 @pytest.mark.e2e_cpu
+@pytest.mark.e2e_slurm


I'm pretty open to not even testing checkpoint gc on slurm.

afaict the only way it actually fails is if bind mounts breaks, and that seems like we probably test it sufficiently elsewhere.

But I added the test because we don't have a good way to expose the gc failure in general, so having one explicit test seems like the right choice.

rb-determined-ai · 2024-10-04T16:26:36Z

e2e_tests/tests/experiment/test_profiling.py

@@ -12,7 +12,6 @@


 @pytest.mark.e2e_gpu
-@pytest.mark.e2e_slurm_gpu


This is testing a rest API, which does not care about the resource manager.

The only thing in python code that could be tested is "can you talk to a GPU", but I think there is basically zero chance that this test fails if gpu training succeeds.

rb-determined-ai · 2024-10-04T16:40:00Z

e2e_tests/tests/experiment/test_pytorch.py

-@pytest.mark.parallel
-@pytest.mark.e2e_slurm_gpu
-def test_pytorch_gradient_aggregation() -> None:


This test was written before we had parallel gpu tests. We didn't have a way to validate our gradient aggregation logic (which is ours, not just a library we call), other than an e2e test.

It never needed a slurm test.

This test has been obsoleted by multi-gpu unit testing.

rb-determined-ai · 2024-10-04T16:40:49Z

e2e_tests/tests/nightly/test_pytorch2.py

@@ -7,14 +7,13 @@

 @pytest.mark.distributed
 @pytest.mark.gpu_required
-@pytest.mark.e2e_slurm_gpu


I can get behind testing our pytorch2 images (so I did not delete the whole test), but I see absolutely no reason why this needs to run on slurm.

rb-determined-ai · 2024-10-04T16:41:11Z

e2e_tests/tests/nightly/test_pytorch2.py

 def test_pytorch2_hf_language_modeling_distributed() -> None:
    sess = api_utils.user_session()
    test_dir = "hf_language_modeling"

    config = conf.load_config(conf.hf_trainer_examples_path(f"{test_dir}/distributed.yaml"))
    config = conf.set_pt2_image(config)
-    config = conf.set_slots_per_trial(config, 4)


Bring this gpu test in line with the slots per trial common to our other distributed gpu tests.

azhou-determined · 2024-10-04T16:55:06Z

e2e_tests/tests/cluster/test_checkpoints.py

@@ -47,7 +47,6 @@ def wait_for_gc_to_finish(sess: api.Session, experiment_ids: List[int]) -> None:


 @pytest.mark.e2e_gpu


does this need to be gpu even?

same for test_gc_checkpoints here

same for test_s3_no_creds (though it's being skipped it seems)

that might be a question for @stoksc, I think maybe that's to run those tests on fewer clusters?

wouldn't e2e_cpu cause it to run on fewer clusters?

dannysauer

From an infra perspective, this is fine. Developers needed for the other parts.

azhou-determined · 2024-10-04T17:23:36Z

not necessarily related to this PR, but i'm looking at cluster/test_logging.py specifically
@pytest.mark.e2e_gpu # Note, e2e_gpu and not gpu_required hits k8s cpu tests.

kind of a weird mark? why are we running gpu tests if they're not required? seems like at least some of them are what the comment says, using it to actually run k8 CPU tests. since we have e2e_k8s, maybe those tests could be replaced with e2e_k8s and e2e_cpu? definitely test_k8s_init_containers but maybe others too.

i guess the question is how many of these tests actually need GPU/need to be tested on specific architectures.

Note that there are nightly tests decorated with: - @e2e_slurm - skipif(not torch.cuda.is_available()) So we still have some GPU-specific slurm tests at this point. But those tests were not actually running as part of the e2e_slurm_gpu tests anyway. This is part of a larger effort to get rid of our znode tests, which are notoriously unreliable.

azhou-determined

🔥

rb-determined-ai requested review from stoksc and azhou-determined October 4, 2024 16:20

rb-determined-ai requested review from a team as code owners October 4, 2024 16:20

cla-bot bot added the cla-signed label Oct 4, 2024

rb-determined-ai self-assigned this Oct 4, 2024

rb-determined-ai force-pushed the rb/rm-znode-ch1 branch from acd6929 to ecf0aac Compare October 4, 2024 16:22

rb-determined-ai requested a review from a team as a code owner October 4, 2024 16:22

rb-determined-ai requested a review from molinamelendezj October 4, 2024 16:22

rb-determined-ai commented Oct 4, 2024

View reviewed changes

rb-determined-ai force-pushed the rb/rm-znode-ch1 branch from ecf0aac to 2b6932f Compare October 4, 2024 16:38

rb-determined-ai commented Oct 4, 2024

View reviewed changes

azhou-determined reviewed Oct 4, 2024

View reviewed changes

dannysauer approved these changes Oct 4, 2024

View reviewed changes

rb-determined-ai force-pushed the rb/rm-znode-ch1 branch from 2b6932f to be035b4 Compare October 4, 2024 19:50

stoksc approved these changes Oct 9, 2024

View reviewed changes

molinamelendezj approved these changes Oct 9, 2024

View reviewed changes

rb-determined-ai force-pushed the rb/rm-znode-ch1 branch from be035b4 to ef348df Compare October 9, 2024 16:22

azhou-determined approved these changes Oct 10, 2024

View reviewed changes

rb-determined-ai merged commit 2594d90 into main Oct 10, 2024
82 of 94 checks passed

rb-determined-ai deleted the rb/rm-znode-ch1 branch October 10, 2024 15:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

chore: remove e2e_slurm_gpu series tests #10021

chore: remove e2e_slurm_gpu series tests #10021

rb-determined-ai commented Oct 4, 2024

codecov bot commented Oct 4, 2024 •

edited

Loading

netlify bot commented Oct 4, 2024 •

edited

Loading

rb-determined-ai Oct 4, 2024

rb-determined-ai Oct 4, 2024

rb-determined-ai Oct 4, 2024

rb-determined-ai Oct 4, 2024

rb-determined-ai Oct 4, 2024

rb-determined-ai Oct 4, 2024

azhou-determined Oct 4, 2024

azhou-determined Oct 4, 2024

azhou-determined Oct 4, 2024

rb-determined-ai Oct 4, 2024

azhou-determined Oct 4, 2024

dannysauer left a comment

azhou-determined commented Oct 4, 2024

azhou-determined left a comment

		@@ -121,6 +120,7 @@ def test_gc_checkpoints_lfs() -> None:


		@pytest.mark.e2e_cpu
		@pytest.mark.e2e_slurm

		@@ -12,7 +12,6 @@


		@pytest.mark.e2e_gpu
		@pytest.mark.e2e_slurm_gpu

		@@ -47,7 +47,6 @@ def wait_for_gc_to_finish(sess: api.Session, experiment_ids: List[int]) -> None:


		@pytest.mark.e2e_gpu

chore: remove e2e_slurm_gpu series tests #10021

chore: remove e2e_slurm_gpu series tests #10021

Conversation

rb-determined-ai commented Oct 4, 2024

codecov bot commented Oct 4, 2024 • edited Loading

Codecov Report

netlify bot commented Oct 4, 2024 • edited Loading

✅ Deploy Preview for determined-ui canceled.

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dannysauer left a comment

Choose a reason for hiding this comment

azhou-determined commented Oct 4, 2024

azhou-determined left a comment

Choose a reason for hiding this comment

codecov bot commented Oct 4, 2024 •

edited

Loading

netlify bot commented Oct 4, 2024 •

edited

Loading