Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fetching PIDs for timeout jobs for cleanup sometimes fail to kill processes #1315

Closed
ilya-da opened this issue Sep 14, 2024 · 2 comments
Closed

Comments

@ilya-da
Copy link

ilya-da commented Sep 14, 2024

Under some circumstances slurm epilog fail to cleanup processes because of parsing of nvidia-smi pmon

From /var/log/slurm/prolog-epilog

  • for i in $(nvidia-smi pmon -c 1 | tail -n+3 | awk '{print $2}' | grep -v -)
  • logger -s -t slurm-epilog 'Killing residual GPU process Idx ...'
    <13>Sep 10 15:12:33 slurm-epilog: Killing residual GPU process Idx ...
  • kill -9 Idx                    <---- this is not a valid PID.
    /etc/slurm/epilog.d/50-exclusive-gpu: line 12: kill: Idx: arguments must be process or job IDs

Regular output should work well, but if for some reason output will contain one more comment line before processes list
it will catch non PID line

root@hpc-hostname:~# nvidia-smi pmon -c 1
# gpu pid type sm mem enc dec command
# Idx # C/G % % % % name
0 - - - - - - -
1 - - - - - - -
2 - - - - - - -
3 - - - - - - -
4 - - - - - - -
5 - - - - - - -
6 - - - - - - -
7 - - - - - - -

@ilya-da
Copy link
Author

ilya-da commented Sep 14, 2024

#1316 proposed solution

Copy link

This issue is stale because it has been open for 60 days with no activity. Please update the issue or it will be closed in 7 days.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant