-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
push: Avoid a race on the SSH server socket #15
Conversation
fcf681c
to
7ccffe4
Compare
Sigh, and today CI fails on Arch with |
7ccffe4
to
69acc13
Compare
69acc13
to
ced437c
Compare
It seems that in some rare cases the ssh master does not actually exits and thus with this patch ostree-push end up waiting indefinitely. I updated the patch to have a timeout and try again with |
ced437c
to
0d78c97
Compare
Sometimes the SSH server is slow to exit so the tempdir cleanup manages to see its socket file before the server deletes it, and then fails when trying to unlink it since it is already gone. Wait for the SSH server to exit to avoid the issue. And do not trust it to exit on SIGTERM to avoid getting stuck waiting indefinitely. For instance, running the testsuite on Debian Buster I frequently get this traceback: def test_refs(self, source_repo, dest_repo, sshd, ssh_options, tmp_files_path, capfd): random_commit(source_repo, tmp_files_path, 'test1') random_commit(source_repo, tmp_files_path, 'test2') self.push_refs(source_repo, dest_repo, sshd, ssh_options, capfd, refs=['test1']) self.push_refs(source_repo, dest_repo, sshd, ssh_options, capfd, > refs=['test2']) tests/test_push.py:263: _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ tests/test_push.py:223: in push_refs ssh_options=ssh_options, command='dumpenv') otpush/push.py:346: in push_refs ssh.run(cmd) /usr/lib/python3.7/tempfile.py:944: in __exit__ self.cleanup() /usr/lib/python3.7/tempfile.py:948: in cleanup _rmtree(self.name) /usr/lib/python3.7/shutil.py:491: in rmtree _rmtree_safe_fd(fd, path, onerror) /usr/lib/python3.7/shutil.py:449: in _rmtree_safe_fd onerror(os.unlink, fullname, sys.exc_info()) _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ topfd = 26, path = '/tmp/ostree-push-673fumwk', onerror = <function rmtree.<locals>.onerror at 0x7f569eb728c8> def _rmtree_safe_fd(topfd, path, onerror): try: with os.scandir(topfd) as scandir_it: entries = list(scandir_it) except OSError as err: err.filename = path onerror(os.scandir, path, sys.exc_info()) return for entry in entries: fullname = os.path.join(path, entry.name) try: is_dir = entry.is_dir(follow_symlinks=False) if is_dir: orig_st = entry.stat(follow_symlinks=False) is_dir = stat.S_ISDIR(orig_st.st_mode) except OSError: is_dir = False if is_dir: try: dirfd = os.open(entry.name, os.O_RDONLY, dir_fd=topfd) except OSError: onerror(os.open, fullname, sys.exc_info()) else: try: if os.path.samestat(orig_st, os.fstat(dirfd)): _rmtree_safe_fd(dirfd, fullname, onerror) try: os.rmdir(entry.name, dir_fd=topfd) except OSError: onerror(os.rmdir, fullname, sys.exc_info()) else: try: # This can only happen if someone replaces # a directory with a symlink after the call to # os.scandir or stat.S_ISDIR above. raise OSError("Cannot call rmtree on a symbolic " "link") except OSError: onerror(os.path.islink, fullname, sys.exc_info()) finally: os.close(dirfd) else: try: > os.unlink(entry.name, dir_fd=topfd) E FileNotFoundError: [Errno 2] No such file or directory: 'socket' /usr/lib/python3.7/shutil.py:447: FileNotFoundError Signed-off-by: Emanuele Aina <[email protected]>
0d78c97
to
45f359a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch. I think I was under the impression that terminate
would wait for the process to exit.
Sorry to leave you hanging on this. I'll make the small s/debug/error/
change and merge.
self.master_proc.kill() | ||
if self.wait_for_exit(wait_timeout): | ||
return | ||
logger.debug('Failed to stop the SSH master process %d', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should be an error
level message. If the SSH server can't be killed with SIGKILL
and is therefore a zombie, I think you'd want to see that message loudly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can't push changes to your branch, so can you make the logger.error
change? I'll just push the change manually in a day or so if you don't get to it.
I put the change into #18 since I couldn't change it on your branch. Closing this. |
Sometimes the SSH server is slow to exit so the tempdir cleanup manages to see its socket file before the server deletes it, and then fails when trying to unlink it since it is already gone.
Wait for the SSH server to exit to avoid the issue.
For instance, running the testsuite on Debian Buster I frequently get this traceback: