Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ci: deflake test_disable_agent_zero_slots #10040

Merged
merged 1 commit into from
Oct 10, 2024
Merged

ci: deflake test_disable_agent_zero_slots #10040

merged 1 commit into from
Oct 10, 2024

Conversation

rb-determined-ai
Copy link
Contributor

It turns out that k8s is just really slow sometimes.

Detail

Here's the logs from a flaky run. Notice:

  • It takes 6m38s to pull the image
  • The init container takes about 20s
  • There's still 3 seconds between when the container starts and when we see logs out of it
  • Based on the 30 second timeout, we are capturing these logs about 10 seconds after the sleep 180 has started, and there's nothing in the logs that suggests the pod is dead or anything

So yeah, k8s is just slow sometimes I guess.

tests/cluster/test_agent_disable.py::test_disable_agent_zero_slots starting at 2024-10-10 14:29:12
== begin task logs ==
[2024-10-10T14:29:14.845359Z]          || INFO: Scheduling Command (formally-humorous-moray) (id: a75c28dc-c993-4a85-8641-9b1a1effeb7a.1)
[2024-10-10T14:29:15.000000Z] a75c28dc || INFO: Job det-6904cfee-cmd-a75c28dc-c993-4a85-8641-9b1a1effeb7a: Created pod: det-6904cfee-cmd-a75c28dc-c993-4a85-8641-9b1a1effeb7a-h6jp4
[2024-10-10T14:29:15.467495Z]          || INFO: Command (formally-humorous-moray) was assigned to an agent
[2024-10-10T14:29:15.000000Z] a75c28dc || INFO: Pod det-6904cfee-cmd-a75c28dc-c993-4a85-8641-9b1a1effeb7a-h6jp4: Pod resources allocated.
[2024-10-10T14:29:15.000000Z] a75c28dc || INFO: Pod det-6904cfee-cmd-a75c28dc-c993-4a85-8641-9b1a1effeb7a-h6jp4: Pulling image "determinedai/pytorch-ngc-dev:0736b6d"
[2024-10-10T14:35:42.000000Z] a75c28dc || INFO: Pod det-6904cfee-cmd-a75c28dc-c993-4a85-8641-9b1a1effeb7a-h6jp4: Successfully pulled image "determinedai/pytorch-ngc-dev:0736b6d" in 6m26.832s (6m26.832s including waiting)
[2024-10-10T14:35:42.000000Z] a75c28dc || INFO: Pod det-6904cfee-cmd-a75c28dc-c993-4a85-8641-9b1a1effeb7a-h6jp4: Created container determined-init-container
[2024-10-10T14:35:42.000000Z] a75c28dc || INFO: Pod det-6904cfee-cmd-a75c28dc-c993-4a85-8641-9b1a1effeb7a-h6jp4: Started container determined-init-container
[2024-10-10T14:35:56.000000Z] a75c28dc || INFO: Pod det-6904cfee-cmd-a75c28dc-c993-4a85-8641-9b1a1effeb7a-h6jp4: Container image "determinedai/pytorch-ngc-dev:0736b6d" already present on machine
[2024-10-10T14:35:56.000000Z] a75c28dc || INFO: Pod det-6904cfee-cmd-a75c28dc-c993-4a85-8641-9b1a1effeb7a-h6jp4: Created container determined-container
[2024-10-10T14:35:56.000000Z] a75c28dc || INFO: Pod det-6904cfee-cmd-a75c28dc-c993-4a85-8641-9b1a1effeb7a-h6jp4: Started container determined-container
[2024-10-10T14:35:57.127835Z]          || INFO: Resources for Command (formally-humorous-moray) have started
[2024-10-10T14:36:00.367504Z] a75c28dc || DEPRECATION: devscripts 2.22.1ubuntu1 has a non-standard version number. pip 24.1 will enforce this behaviour change. A possible replacement is to upgrade to a newer version of devscripts or contact the author to suggest that they release a version with a conforming version number. Discussion can be found at https://github.com/pypa/pip/issues/12063
[2024-10-10T14:36:01.194807Z] a75c28dc || WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv
[2024-10-10T14:36:01.659255Z] a75c28dc || 
[2024-10-10T14:36:01.659362Z] a75c28dc || [notice] A new release of pip is available: 24.0 -> 24.2
[2024-10-10T14:36:01.659375Z] a75c28dc || [notice] To update, run: python3 -m pip install --upgrade pip
[2024-10-10T14:36:02.499951Z] a75c28dc || + test -f /run/determined/dynamic-tcd-startup-hook.sh
[2024-10-10T14:36:02.500041Z] a75c28dc || + source /run/determined/dynamic-tcd-startup-hook.sh
[2024-10-10T14:36:02.500059Z] a75c28dc || ++ echo hello from master tcd startup hook
[2024-10-10T14:36:02.500068Z] a75c28dc || + test -f startup-hook.sh
[2024-10-10T14:36:02.500076Z] a75c28dc || + set +x
[2024-10-10T14:36:02.500249Z] a75c28dc || hello from master tcd startup hook
Task log stream ended. To reopen log stream, run: det task logs -f a75c28dc-c993-4a85-8641-9b1a1effeb7a

== end task logs ==

It turns out that k8s is just really slow sometimes.
@rb-determined-ai rb-determined-ai requested a review from a team as a code owner October 10, 2024 21:47
@cla-bot cla-bot bot added the cla-signed label Oct 10, 2024
Copy link

netlify bot commented Oct 10, 2024

Deploy Preview for determined-ui ready!

Name Link
🔨 Latest commit 95f6864
🔍 Latest deploy log https://app.netlify.com/sites/determined-ui/deploys/67084b6bc786aa00082bb561
😎 Deploy Preview https://deploy-preview-10040--determined-ui.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

Copy link

codecov bot commented Oct 10, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 54.62%. Comparing base (2594d90) to head (95f6864).
Report is 3 commits behind head on main.

Additional details and impacted files
@@           Coverage Diff           @@
##             main   #10040   +/-   ##
=======================================
  Coverage   54.62%   54.62%           
=======================================
  Files        1260     1260           
  Lines      157558   157558           
  Branches     3632     3631    -1     
=======================================
+ Hits        86071    86073    +2     
+ Misses      71353    71351    -2     
  Partials      134      134           
Flag Coverage Δ
backend 45.42% <ø> (+<0.01%) ⬆️
harness 72.74% <ø> (ø)
web 54.38% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

see 3 files with indirect coverage changes

@rb-determined-ai rb-determined-ai merged commit b243c26 into main Oct 10, 2024
86 of 98 checks passed
@rb-determined-ai rb-determined-ai deleted the rb/flake branch October 10, 2024 22:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants