Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rbd-target-api stuck, tcmu-runner crash / iSCSI Login negotiation failed #267

Open
discostur opened this issue Sep 26, 2022 · 0 comments
Open

Comments

@discostur
Copy link

Hi,

from time to time i see that my docker tcmu-runner container is exiting and restarting. After that, i get lots of

Sep 25 12:39:52 ceph-osd02 kernel: Unable to locate Target Portal Group on iqn.2019-03.com.redhat.iscsi-gw:ceph-igw
Sep 25 12:39:52 ceph-osd02 kernel: iSCSI Login negotiation failed.
Sep 25 12:39:52 ceph-osd02 kernel: Unable to locate Target Portal Group on iqn.2019-03.com.redhat.iscsi-gw:ceph-igw
Sep 25 12:39:52 ceph-osd02 kernel: iSCSI Login negotiation failed.

errors on the failed tcmu-runner node. So the tcmu-runner container has restarted, is running but clients are not able to login via iscsi. The other 3x tcmu-runner nodes are running like expected. On the failed node i'm not able to restart rbd-target-api because it seems to be stuck. The only thing i can do is to reboot the whole node.

I checked all logs and it seems to occur from a slow OSD (deep scrub was running at the moment the trouble begun):

Sep 25 12:39:42 ceph-mon01 journal: cluster 2022-09-25T12:39:41.500225+0200 osd.39 (osd.39) 4190 : cluster [WRN] 4 slow requests (by type [ 'delayed' : 4 ] most affected pool [ 'ovirt' : 4 ])
Sep 25 12:39:42 ceph-mon01 docker: cluster 2022-09-25T12:39:41.250374+0200 mgr.ceph-mon01 (mgr.40357165) 8331237 : cluster [DBG] pgmap v5468170: 2370 pgs: 1 active+clean+scrubbing, 2 active+clean+laggy, 4 active+clean+scrubbing+deep, 2363 active+clean; 14 TiB data, 45 TiB used, 110 TiB / 155 TiB avail; 2.6 MiB/s rd, 10 MiB/s wr, 883 op/s
Sep 25 12:39:42 ceph-mon01 docker: cluster 2022-09-25T12:39:41.500225+0200 osd.39 (osd.39) 4190 : cluster [WRN] 4 slow requests (by type [ 'delayed' : 4 ] most affected pool [ 'ovirt' : 4 ])

tcmu-runner logs from slow IO on osd.39 till reboot:

2022-09-25 03:46:36.558 83 [CRIT] create_file_output:485: log file path now is '/var/log/tcmu-runner/tcmu-runner.log'
2022-09-25 12:39:40.136 83 [ERROR] tcmu_rbd_handle_timedout_cmd:1226 rbd/ovirt.lun1: Timing out cmd.
2022-09-25 12:39:40.137 83 [ERROR] __tcmu_notify_conn_lost:214 rbd/ovirt.lun1: Handler connection lost (lock state 5)
2022-09-25 12:39:48.872 83 [ERROR] tcmu_rbd_handle_timedout_cmd:1226 rbd/ovirt.lun1: Timing out cmd.
2022-09-25 12:39:48.883 83 [ERROR] tcmu_rbd_handle_timedout_cmd:1226 rbd/ovirt.lun1: Timing out cmd.
2022-09-25 12:39:48.884 83 [ERROR] tcmu_rbd_handle_timedout_cmd:1226 rbd/ovirt.lun1: Timing out cmd.
2022-09-25 12:39:48.885 83 [INFO] tgt_port_grp_recovery_work_fn:246: Disabled iscsi/iqn.2019-03.com.redhat.iscsi-gw:ceph-igw/tpgt_2.
2022-09-25 12:40:04.916 83 [INFO] tcmu_rbd_close:1199 rbd/ovirt.lun1: appended blocklist entry: {192.168.90.32:0/3857887088}
2022-09-25 12:40:05.048 83 [INFO] tcmu_rbd_rm_stale_entries_from_blacklist:341 rbd/ovirt.lun1: removing blocklist entry: {192.168.90.32:0/3857887088}
2022-09-25 12:40:05.049 83 [INFO] tgt_port_grp_recovery_work_fn:272: Enabled iscsi/iqn.2019-03.com.redhat.iscsi-gw:ceph-igw/tpgt_2.
2022-09-25 12:40:06.892 83 [WARN] tcmu_notify_lock_lost:271 rbd/ovirt.lun1: Async lock drop. Old state 5
2022-09-25 12:40:06.893 83 [WARN] tcmu_notify_lock_lost:271 rbd/ovirt.lun1: Async lock drop. Old state 1
2022-09-25 12:40:06.911 83 [INFO] alua_implicit_transition:592 rbd/ovirt.lun1: Starting write lock acquisition operation.
2022-09-25 12:40:06.912 83 [WARN] tcmu_notify_lock_lost:271 rbd/ovirt.lun1: Async lock drop. Old state 4
2022-09-25 12:40:06.912 83 [WARN] tcmu_notify_lock_lost:271 rbd/ovirt.lun1: Async lock drop. Old state 4
2022-09-25 12:40:06.913 83 [WARN] tcmu_notify_lock_lost:271 rbd/ovirt.lun1: Async lock drop. Old state 4
2022-09-25 12:40:06.913 83 [WARN] tcmu_notify_lock_lost:271 rbd/ovirt.lun1: Async lock drop. Old state 4
2022-09-25 12:40:06.913 83 [WARN] tcmu_notify_lock_lost:271 rbd/ovirt.lun1: Async lock drop. Old state 4
2022-09-25 12:40:06.923 83 [WARN] tcmu_notify_lock_lost:271 rbd/ovirt.lun1: Async lock drop. Old state 4
2022-09-25 12:40:06.941 83 [WARN] tcmu_notify_lock_lost:271 rbd/ovirt.lun1: Async lock drop. Old state 4
2022-09-25 12:40:07.107 83 [WARN] tcmu_notify_lock_lost:271 rbd/ovirt.lun1: Async lock drop. Old state 4
2022-09-25 12:40:07.107 83 [WARN] tcmu_notify_lock_lost:271 rbd/ovirt.lun1: Async lock drop. Old state 4
2022-09-25 12:40:07.108 83 [WARN] tcmu_notify_lock_lost:271 rbd/ovirt.lun1: Async lock drop. Old state 4
2022-09-25 12:40:07.108 83 [WARN] tcmu_notify_lock_lost:271 rbd/ovirt.lun1: Async lock drop. Old state 4
2022-09-25 12:40:07.108 83 [WARN] tcmu_notify_lock_lost:271 rbd/ovirt.lun1: Async lock drop. Old state 4
2022-09-25 12:40:07.109 83 [WARN] tcmu_notify_lock_lost:271 rbd/ovirt.lun1: Async lock drop. Old state 4
2022-09-25 12:40:07.109 83 [WARN] tcmu_notify_lock_lost:271 rbd/ovirt.lun1: Async lock drop. Old state 4
2022-09-25 12:40:07.109 83 [WARN] tcmu_notify_lock_lost:271 rbd/ovirt.lun1: Async lock drop. Old state 4
2022-09-25 12:40:07.109 83 [WARN] tcmu_notify_lock_lost:271 rbd/ovirt.lun1: Async lock drop. Old state 4
2022-09-25 12:40:07.109 83 [WARN] tcmu_notify_lock_lost:271 rbd/ovirt.lun1: Async lock drop. Old state 4
2022-09-25 12:40:07.109 83 [WARN] tcmu_notify_lock_lost:271 rbd/ovirt.lun1: Async lock drop. Old state 4
2022-09-25 12:40:07.109 83 [WARN] tcmu_notify_lock_lost:271 rbd/ovirt.lun1: Async lock drop. Old state 4
2022-09-25 12:40:07.109 83 [WARN] tcmu_notify_lock_lost:271 rbd/ovirt.lun1: Async lock drop. Old state 4
2022-09-25 12:40:07.109 83 [WARN] tcmu_notify_lock_lost:271 rbd/ovirt.lun1: Async lock drop. Old state 4
2022-09-25 12:40:07.109 83 [WARN] tcmu_notify_lock_lost:271 rbd/ovirt.lun1: Async lock drop. Old state 4
2022-09-25 12:40:07.109 83 [WARN] tcmu_notify_lock_lost:271 rbd/ovirt.lun1: Async lock drop. Old state 4
2022-09-25 12:40:07.109 83 [WARN] tcmu_notify_lock_lost:271 rbd/ovirt.lun1: Async lock drop. Old state 4
2022-09-25 12:40:07.109 83 [WARN] tcmu_notify_lock_lost:271 rbd/ovirt.lun1: Async lock drop. Old state 4
2022-09-25 12:40:07.109 83 [WARN] tcmu_notify_lock_lost:271 rbd/ovirt.lun1: Async lock drop. Old state 4
2022-09-25 12:40:07.109 83 [WARN] tcmu_notify_lock_lost:271 rbd/ovirt.lun1: Async lock drop. Old state 4
2022-09-25 12:40:07.109 83 [WARN] tcmu_notify_lock_lost:271 rbd/ovirt.lun1: Async lock drop. Old state 4
2022-09-25 12:40:07.518 83 [WARN] tcmu_notify_lock_lost:271 rbd/ovirt.lun1: Async lock drop. Old state 4
2022-09-25 12:40:07.518 83 [WARN] tcmu_notify_lock_lost:271 rbd/ovirt.lun1: Async lock drop. Old state 4
2022-09-25 12:40:07.518 83 [WARN] tcmu_notify_lock_lost:271 rbd/ovirt.lun1: Async lock drop. Old state 4
2022-09-25 12:40:07.519 83 [WARN] tcmu_notify_lock_lost:271 rbd/ovirt.lun1: Async lock drop. Old state 4
2022-09-25 12:40:07.519 83 [WARN] tcmu_notify_lock_lost:271 rbd/ovirt.lun1: Async lock drop. Old state 4
2022-09-25 12:40:07.519 83 [WARN] tcmu_notify_lock_lost:271 rbd/ovirt.lun1: Async lock drop. Old state 4
2022-09-25 12:40:07.519 83 [WARN] tcmu_notify_lock_lost:271 rbd/ovirt.lun1: Async lock drop. Old state 4
2022-09-25 12:40:07.519 83 [WARN] tcmu_notify_lock_lost:271 rbd/ovirt.lun1: Async lock drop. Old state 4
2022-09-25 12:40:07.519 83 [WARN] tcmu_notify_lock_lost:271 rbd/ovirt.lun1: Async lock drop. Old state 4
2022-09-25 12:40:07.519 83 [WARN] tcmu_notify_lock_lost:271 rbd/ovirt.lun1: Async lock drop. Old state 4
2022-09-25 13:56:26.504 83 [CRIT] main:1425: Exiting...
2022-09-25 13:56:26.612 83 [INFO] tcmu_rbd_close:1199 rbd/ovirt.lun3: appended blocklist entry: {192.168.90.32:0/3847743111}
2022-09-25 13:56:26.635 83 [INFO] tcmu_rbd_close:1199 rbd/ovirt.lun2: appended blocklist entry: {192.168.90.32:0/1152438333}
2022-09-25 13:59:45.551 82 [CRIT] main:1302: Starting...
2022-09-25 13:59:45.593 82 [INFO] load_our_module:576: Inserted module 'target_core_user'
2022-09-25 13:59:45.594 82 [INFO] tcmur_register_handler:92: Handler fbo is registered
2022-09-25 13:59:45.596 82 [INFO] tcmur_register_handler:92: Handler zbc is registered
2022-09-25 13:59:45.616 82 [INFO] tcmur_register_handler:92: Handler glfs is registered
2022-09-25 13:59:45.616 82 [INFO] tcmur_register_handler:92: Handler qcow is registered
2022-09-25 13:59:45.676 82 [INFO] tcmur_register_handler:92: Handler rbd is registered
2022-09-25 14:00:53.917 82 [INFO] alua_implicit_transition:592 rbd/ovirt.lun1: Starting write lock acquisition operation.
2022-09-25 14:00:54.643 82 [INFO] tcmu_rbd_lock:948 rbd/ovirt.lun1: Acquired exclusive lock.
2022-09-25 14:00:54.643 82 [INFO] tcmu_acquire_dev_lock:489 rbd/ovirt.lun1: Write lock acquisition successful
2022-09-25 14:00:57.362 82 [WARN] tcmu_notify_lock_lost:271 rbd/ovirt.lun1: Async lock drop. Old state 5
2022-09-25 14:00:57.362 82 [WARN] tcmu_notify_lock_lost:271 rbd/ovirt.lun1: Async lock drop. Old state 1
2022-09-25 14:00:57.362 82 [INFO] alua_implicit_transition:592 rbd/ovirt.lun1: Starting write lock acquisition operation.
2022-09-25 14:00:57.362 82 [WARN] tcmu_notify_lock_lost:271 rbd/ovirt.lun1: Async lock drop. Old state 4
2022-09-25 14:00:57.362 82 [WARN] tcmu_notify_lock_lost:271 rbd/ovirt.lun1: Async lock drop. Old state 4
2022-09-25 14:00:57.367 82 [INFO] tcmu_rbd_close:1199 rbd/ovirt.lun1: appended blocklist entry: {192.168.90.32:0/3722889083}
2022-09-25 14:00:57.402 82 [INFO] tcmu_rbd_rm_stale_entries_from_blacklist:341 rbd/ovirt.lun1: removing blocklist entry: {192.168.90.32:0/3722889083}
2022-09-25 14:00:58.716 82 [INFO] tcmu_acquire_dev_lock:482 rbd/ovirt.lun1: Lock acquisition unsuccessful
2022-09-25 14:00:58.719 82 [INFO] alua_implicit_transition:592 rbd/ovirt.lun1: Starting write lock acquisition operation.
2022-09-25 14:00:59.737 82 [INFO] tcmu_rbd_lock:948 rbd/ovirt.lun1: Acquired exclusive lock.
2022-09-25 14:00:59.737 82 [INFO] tcmu_acquire_dev_lock:489 rbd/ovirt.lun1: Write lock acquisition successful
2022-09-25 14:01:00.784 82 [WARN] tcmu_notify_lock_lost:271 rbd/ovirt.lun1: Async lock drop. Old state 5
2022-09-25 14:01:00.784 82 [WARN] tcmu_notify_lock_lost:271 rbd/ovirt.lun1: Async lock drop. Old state 1
2022-09-25 14:01:00.784 82 [WARN] tcmu_notify_lock_lost:271 rbd/ovirt.lun1: Async lock drop. Old state 1
2022-09-25 14:01:00.790 82 [INFO] alua_implicit_transition:592 rbd/ovirt.lun1: Starting write lock acquisition operation.
2022-09-25 14:01:00.795 82 [INFO] tcmu_rbd_close:1199 rbd/ovirt.lun1: appended blocklist entry: {192.168.90.32:0/128875194}
2022-09-25 14:01:00.843 82 [INFO] tcmu_rbd_rm_stale_entries_from_blacklist:341 rbd/ovirt.lun1: removing blocklist entry: {192.168.90.32:0/128875194}
2022-09-25 14:01:02.844 82 [INFO] tcmu_rbd_lock:948 rbd/ovirt.lun1: Acquired exclusive lock.
2022-09-25 14:01:02.844 82 [INFO] tcmu_acquire_dev_lock:489 rbd/ovirt.lun1: Write lock acquisition successful
2022-09-25 14:03:01.333 82 [ERROR] tcmu_rbd_handle_timedout_cmd:1226 rbd/ovirt.lun1: Timing out cmd.
2022-09-25 14:03:01.333 82 [ERROR] __tcmu_notify_conn_lost:214 rbd/ovirt.lun1: Handler connection lost (lock state 5)
2022-09-25 14:03:02.687 82 [ERROR] tcmu_rbd_handle_timedout_cmd:1226 rbd/ovirt.lun1: Timing out cmd.
2022-09-25 14:03:05.365 82 [INFO] tgt_port_grp_recovery_work_fn:246: Disabled iscsi/iqn.2019-03.com.redhat.iscsi-gw:ceph-igw/tpgt_2.
2022-09-25 14:03:05.387 82 [INFO] tcmu_rbd_close:1199 rbd/ovirt.lun1: appended blocklist entry: {192.168.90.32:0/543220842}
2022-09-25 14:03:05.448 82 [INFO] tcmu_rbd_rm_stale_entries_from_blacklist:341 rbd/ovirt.lun1: removing blocklist entry: {192.168.90.32:0/543220842}
2022-09-25 14:03:05.449 82 [INFO] tgt_port_grp_recovery_work_fn:272: Enabled iscsi/iqn.2019-03.com.redhat.iscsi-gw:ceph-igw/tpgt_2.
2022-09-25 14:04:56.278 82 [ERROR] tcmu_rbd_handle_timedout_cmd:1226 rbd/ovirt.lun1: Timing out cmd.
2022-09-25 14:04:56.278 82 [ERROR] __tcmu_notify_conn_lost:214 rbd/ovirt.lun1: Handler connection lost (lock state 5)
2022-09-25 14:05:07.307 82 [ERROR] tcmu_rbd_handle_timedout_cmd:1226 rbd/ovirt.lun1: Timing out cmd.
2022-09-25 14:05:11.951 82 [ERROR] tcmu_rbd_handle_timedout_cmd:1226 rbd/ovirt.lun1: Timing out cmd.
2022-09-25 14:05:18.861 82 [ERROR] tcmu_rbd_handle_timedout_cmd:1226 rbd/ovirt.lun1: Timing out cmd.
2022-09-25 14:05:18.862 82 [INFO] tgt_port_grp_recovery_work_fn:246: Disabled iscsi/iqn.2019-03.com.redhat.iscsi-gw:ceph-igw/tpgt_2.
2022-09-25 14:05:19.046 82 [INFO] tgt_port_grp_recovery_work_fn:272: Enabled iscsi/iqn.2019-03.com.redhat.iscsi-gw:ceph-igw/tpgt_2.

For me it looks like tcmu-runner isn't able to send IO requests to osd.39 and then it fails. After that, target-api is stuck and iscsi clients try to login but fail. Is there anything i can do against such situations?

It is not only that i have to reboot the exporter node, on ovirt / client site i have really slow IO and VMs are pausing.

I found two other issues which seem to be the same direction (maybe other issue but tcmu-runner fails and target-api is stuck):

ceph/ceph-iscsi-cli#102
#169

Versions:

  • ceph version 16.2.9 (4c3647a322c0ff5a1dd2344e039859dcbd28c830) pacific (stable)
  • centos 7.9
  • kernel 3.10.0-1160.66.1.el7.x86_64
  • docker deployment (latest-pacific)
  • tcmu-runner 1.5.4

OSD Config:

[osd]
osd_client_watch_timeout = 15
osd_heartbeat_grace = 20
osd_heartbeat_interval = 5

Target-CLI:

  • osd_op_timeout=30

Thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant