Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chore: remove e2e_slurm_gpu series tests #10021

Merged
merged 1 commit into from
Oct 10, 2024
Merged

Conversation

rb-determined-ai
Copy link
Contributor

Note that there are nightly tests decorated with:

  • @e2e_slurm
  • skipif(not torch.cuda.is_available())

So we still have some GPU-specific slurm tests at this point. But those tests were not actually running as part of the e2e_slurm_gpu tests anyway.

This is part of a larger effort to get rid of our znode tests, which are notoriously unreliable.

Copy link

codecov bot commented Oct 4, 2024

Codecov Report

Attention: Patch coverage is 0% with 1 line in your changes missing coverage. Please review.

Project coverage is 54.58%. Comparing base (a0cc818) to head (ef348df).
Report is 15 commits behind head on main.

Files with missing lines Patch % Lines
...ess/tests/experiment/pytorch/test_pytorch_trial.py 0.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main   #10021      +/-   ##
==========================================
- Coverage   54.59%   54.58%   -0.01%     
==========================================
  Files        1259     1259              
  Lines      157245   157243       -2     
  Branches     3620     3620              
==========================================
- Hits        85843    85831      -12     
- Misses      71269    71279      +10     
  Partials      133      133              
Flag Coverage Δ
backend 45.33% <ø> (-0.03%) ⬇️
harness 72.74% <0.00%> (+<0.01%) ⬆️
web 54.34% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
...ess/tests/experiment/pytorch/test_pytorch_trial.py 89.21% <0.00%> (+0.28%) ⬆️

... and 4 files with indirect coverage changes

Copy link

netlify bot commented Oct 4, 2024

Deploy Preview for determined-ui canceled.

Name Link
🔨 Latest commit ef348df
🔍 Latest deploy log https://app.netlify.com/sites/determined-ui/deploys/6706adc11bd4ff0008080919

@@ -47,7 +47,6 @@ def wait_for_gc_to_finish(sess: api.Session, experiment_ids: List[int]) -> None:


@pytest.mark.e2e_gpu
@pytest.mark.e2e_slurm_gpu
def test_set_gc_policy() -> None:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wut. Why was this ever a gpu test. Seriously.

Instead of demoting this test to e2e_slurm, I moved the e2e_slurm to test_delete_checkpoints which actually tests that checkpoint_gc works.

(I almost deleted test_set_gc_policy() last week just because it doesn't check the results of setting gc policy, it only makes sure the CLI doesn't crash(!))

@@ -121,6 +120,7 @@ def test_gc_checkpoints_lfs() -> None:


@pytest.mark.e2e_cpu
@pytest.mark.e2e_slurm
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm pretty open to not even testing checkpoint gc on slurm.

afaict the only way it actually fails is if bind mounts breaks, and that seems like we probably test it sufficiently elsewhere.

But I added the test because we don't have a good way to expose the gc failure in general, so having one explicit test seems like the right choice.

@@ -12,7 +12,6 @@


@pytest.mark.e2e_gpu
@pytest.mark.e2e_slurm_gpu
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is testing a rest API, which does not care about the resource manager.

The only thing in python code that could be tested is "can you talk to a GPU", but I think there is basically zero chance that this test fails if gpu training succeeds.

Comment on lines -10 to -12
@pytest.mark.parallel
@pytest.mark.e2e_slurm_gpu
def test_pytorch_gradient_aggregation() -> None:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test was written before we had parallel gpu tests. We didn't have a way to validate our gradient aggregation logic (which is ours, not just a library we call), other than an e2e test.

It never needed a slurm test.

This test has been obsoleted by multi-gpu unit testing.

@@ -7,14 +7,13 @@

@pytest.mark.distributed
@pytest.mark.gpu_required
@pytest.mark.e2e_slurm_gpu
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can get behind testing our pytorch2 images (so I did not delete the whole test), but I see absolutely no reason why this needs to run on slurm.

def test_pytorch2_hf_language_modeling_distributed() -> None:
sess = api_utils.user_session()
test_dir = "hf_language_modeling"

config = conf.load_config(conf.hf_trainer_examples_path(f"{test_dir}/distributed.yaml"))
config = conf.set_pt2_image(config)
config = conf.set_slots_per_trial(config, 4)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bring this gpu test in line with the slots per trial common to our other distributed gpu tests.

@@ -47,7 +47,6 @@ def wait_for_gc_to_finish(sess: api.Session, experiment_ids: List[int]) -> None:


@pytest.mark.e2e_gpu
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this need to be gpu even?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same for test_gc_checkpoints here

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same for test_s3_no_creds (though it's being skipped it seems)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that might be a question for @stoksc, I think maybe that's to run those tests on fewer clusters?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wouldn't e2e_cpu cause it to run on fewer clusters?

Copy link
Contributor

@dannysauer dannysauer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From an infra perspective, this is fine. Developers needed for the other parts.

@azhou-determined
Copy link
Contributor

not necessarily related to this PR, but i'm looking at cluster/test_logging.py specifically
@pytest.mark.e2e_gpu # Note, e2e_gpu and not gpu_required hits k8s cpu tests.

kind of a weird mark? why are we running gpu tests if they're not required? seems like at least some of them are what the comment says, using it to actually run k8 CPU tests. since we have e2e_k8s, maybe those tests could be replaced with e2e_k8s and e2e_cpu? definitely test_k8s_init_containers but maybe others too.

i guess the question is how many of these tests actually need GPU/need to be tested on specific architectures.

Note that there are nightly tests decorated with:

   - @e2e_slurm
   - skipif(not torch.cuda.is_available())

So we still have some GPU-specific slurm tests at this point.  But those
tests were not actually running as part of the e2e_slurm_gpu tests
anyway.

This is part of a larger effort to get rid of our znode tests, which
are notoriously unreliable.
Copy link
Contributor

@azhou-determined azhou-determined left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔥

@rb-determined-ai rb-determined-ai merged commit 2594d90 into main Oct 10, 2024
82 of 94 checks passed
@rb-determined-ai rb-determined-ai deleted the rb/rm-znode-ch1 branch October 10, 2024 15:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants