Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

push: Avoid a race on the SSH server socket #15

Closed
wants to merge 1 commit into from

Conversation

em-
Copy link
Contributor

@em- em- commented Jul 14, 2023

Sometimes the SSH server is slow to exit so the tempdir cleanup manages to see its socket file before the server deletes it, and then fails when trying to unlink it since it is already gone.

Wait for the SSH server to exit to avoid the issue.

For instance, running the testsuite on Debian Buster I frequently get this traceback:

  def test_refs(self, source_repo, dest_repo, sshd, ssh_options,
                tmp_files_path, capfd):
      random_commit(source_repo, tmp_files_path, 'test1')
      random_commit(source_repo, tmp_files_path, 'test2')
      self.push_refs(source_repo, dest_repo, sshd, ssh_options, capfd,
                     refs=['test1'])
      self.push_refs(source_repo, dest_repo, sshd, ssh_options, capfd,
>                    refs=['test2'])

tests/test_push.py:263:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
tests/test_push.py:223: in push_refs
    ssh_options=ssh_options, command='dumpenv')
otpush/push.py:346: in push_refs
    ssh.run(cmd)
/usr/lib/python3.7/tempfile.py:944: in __exit__
    self.cleanup()
/usr/lib/python3.7/tempfile.py:948: in cleanup
    _rmtree(self.name)
/usr/lib/python3.7/shutil.py:491: in rmtree
    _rmtree_safe_fd(fd, path, onerror)
/usr/lib/python3.7/shutil.py:449: in _rmtree_safe_fd
    onerror(os.unlink, fullname, sys.exc_info())
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

topfd = 26, path = '/tmp/ostree-push-673fumwk', onerror = <function rmtree.<locals>.onerror at 0x7f569eb728c8>

    def _rmtree_safe_fd(topfd, path, onerror):
        try:
            with os.scandir(topfd) as scandir_it:
                entries = list(scandir_it)
        except OSError as err:
            err.filename = path
            onerror(os.scandir, path, sys.exc_info())
            return
        for entry in entries:
            fullname = os.path.join(path, entry.name)
            try:
                is_dir = entry.is_dir(follow_symlinks=False)
                if is_dir:
                    orig_st = entry.stat(follow_symlinks=False)
                    is_dir = stat.S_ISDIR(orig_st.st_mode)
            except OSError:
                is_dir = False
            if is_dir:
                try:
                    dirfd = os.open(entry.name, os.O_RDONLY, dir_fd=topfd)
                except OSError:
                    onerror(os.open, fullname, sys.exc_info())
                else:
                    try:
                        if os.path.samestat(orig_st, os.fstat(dirfd)):
                            _rmtree_safe_fd(dirfd, fullname, onerror)
                            try:
                                os.rmdir(entry.name, dir_fd=topfd)
                            except OSError:
                                onerror(os.rmdir, fullname, sys.exc_info())
                        else:
                            try:
                                # This can only happen if someone replaces
                                # a directory with a symlink after the call to
                                # os.scandir or stat.S_ISDIR above.
                                raise OSError("Cannot call rmtree on a symbolic "
                                              "link")
                            except OSError:
                                onerror(os.path.islink, fullname, sys.exc_info())
                    finally:
                        os.close(dirfd)
            else:
                try:
>                   os.unlink(entry.name, dir_fd=topfd)
E                   FileNotFoundError: [Errno 2] No such file or directory: 'socket'

/usr/lib/python3.7/shutil.py:447: FileNotFoundError

@em-
Copy link
Contributor Author

em- commented Sep 11, 2023

Sigh, and today CI fails on Arch with Failed to build PyGObject. :(

@em-
Copy link
Contributor Author

em- commented Oct 18, 2023

It seems that in some rare cases the ssh master does not actually exits and thus with this patch ostree-push end up waiting indefinitely. I updated the patch to have a timeout and try again with SIGKILL before going on.

Sometimes the SSH server is slow to exit so the tempdir cleanup manages
to see its socket file before the server deletes it, and then fails when
trying to unlink it since it is already gone.

Wait for the SSH server to exit to avoid the issue. And do not trust it
to exit on SIGTERM to avoid getting stuck waiting indefinitely.

For instance, running the testsuite on Debian Buster I frequently get
this traceback:

      def test_refs(self, source_repo, dest_repo, sshd, ssh_options,
                    tmp_files_path, capfd):
          random_commit(source_repo, tmp_files_path, 'test1')
          random_commit(source_repo, tmp_files_path, 'test2')
          self.push_refs(source_repo, dest_repo, sshd, ssh_options, capfd,
                         refs=['test1'])
          self.push_refs(source_repo, dest_repo, sshd, ssh_options, capfd,
    >                    refs=['test2'])

    tests/test_push.py:263:
    _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
    tests/test_push.py:223: in push_refs
        ssh_options=ssh_options, command='dumpenv')
    otpush/push.py:346: in push_refs
        ssh.run(cmd)
    /usr/lib/python3.7/tempfile.py:944: in __exit__
        self.cleanup()
    /usr/lib/python3.7/tempfile.py:948: in cleanup
        _rmtree(self.name)
    /usr/lib/python3.7/shutil.py:491: in rmtree
        _rmtree_safe_fd(fd, path, onerror)
    /usr/lib/python3.7/shutil.py:449: in _rmtree_safe_fd
        onerror(os.unlink, fullname, sys.exc_info())
    _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

    topfd = 26, path = '/tmp/ostree-push-673fumwk', onerror = <function rmtree.<locals>.onerror at 0x7f569eb728c8>

        def _rmtree_safe_fd(topfd, path, onerror):
            try:
                with os.scandir(topfd) as scandir_it:
                    entries = list(scandir_it)
            except OSError as err:
                err.filename = path
                onerror(os.scandir, path, sys.exc_info())
                return
            for entry in entries:
                fullname = os.path.join(path, entry.name)
                try:
                    is_dir = entry.is_dir(follow_symlinks=False)
                    if is_dir:
                        orig_st = entry.stat(follow_symlinks=False)
                        is_dir = stat.S_ISDIR(orig_st.st_mode)
                except OSError:
                    is_dir = False
                if is_dir:
                    try:
                        dirfd = os.open(entry.name, os.O_RDONLY, dir_fd=topfd)
                    except OSError:
                        onerror(os.open, fullname, sys.exc_info())
                    else:
                        try:
                            if os.path.samestat(orig_st, os.fstat(dirfd)):
                                _rmtree_safe_fd(dirfd, fullname, onerror)
                                try:
                                    os.rmdir(entry.name, dir_fd=topfd)
                                except OSError:
                                    onerror(os.rmdir, fullname, sys.exc_info())
                            else:
                                try:
                                    # This can only happen if someone replaces
                                    # a directory with a symlink after the call to
                                    # os.scandir or stat.S_ISDIR above.
                                    raise OSError("Cannot call rmtree on a symbolic "
                                                  "link")
                                except OSError:
                                    onerror(os.path.islink, fullname, sys.exc_info())
                        finally:
                            os.close(dirfd)
                else:
                    try:
    >                   os.unlink(entry.name, dir_fd=topfd)
    E                   FileNotFoundError: [Errno 2] No such file or directory: 'socket'

    /usr/lib/python3.7/shutil.py:447: FileNotFoundError

Signed-off-by: Emanuele Aina <[email protected]>
Copy link
Owner

@dbnicholson dbnicholson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. I think I was under the impression that terminate would wait for the process to exit.

Sorry to leave you hanging on this. I'll make the small s/debug/error/ change and merge.

self.master_proc.kill()
if self.wait_for_exit(wait_timeout):
return
logger.debug('Failed to stop the SSH master process %d',
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be an error level message. If the SSH server can't be killed with SIGKILL and is therefore a zombie, I think you'd want to see that message loudly.

Copy link
Owner

@dbnicholson dbnicholson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't push changes to your branch, so can you make the logger.error change? I'll just push the change manually in a day or so if you don't get to it.

@dbnicholson
Copy link
Owner

I put the change into #18 since I couldn't change it on your branch. Closing this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants